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Preface to the First Edition 


System identification is a diverse field that can be presented in many different ways. 
The subtitle. Theory for the User, reflects the attitude of the present treatment. Yes, 
the book is about theory, but the focus is on theory that has direct consequences 
for the understanding and practical use of available techniques. My goal has been 
to give the reader a firm grip on basic principles so that he or she can confidently 
approach a practical problem. as well as the rich and sometimes confusing literature 
on the subject. 

Stressing the utilitarian aspect of theory should not, I believe. be taken as an 
excuse for sloppy mathematics. Therefore, I have tried to develop the theory without 
cheating. The more technical parts have, however, been placed in appendixes or 
in asterisk-marked sections. so that the reluctant reader does not have to stumble 
through them. In fact, it is a redeeming feature of life that we are able to use many 
things without understanding every detail of them. This is true also of the theory of 
system identification. The practitioner who is looking for some quick advice should 
thus be able to proceed rapidly to Part III (User’s Choices) by hopping through the 
summary sections of the earlier chapters. 

The core material of the book should be suitable for a graduate-leve] course 
in system identification. As a prerequisite for such a course. it is natural, although 
not absolutely necessary, to require that the student should be somewhat familiar 
with dynamical systems and stochastic signals. The manuscript has been used as 
a text for system identification courses at Stanford University, the Massachusetts 
Institute of Technology, Yale University, the Australian National University and the 
Universities of Lund and Linköping. Course outlines, as well as a solutions manual 
for the problems. are available from the publisher. 

The existing literature on system identification is indeed extensive and virtually 
impossible to cover in a bibliography. In this book I have tried to concentrate on 
recent and easily available references that I think are suitable for further study, as 
well as on some earlier works that reflect the roots of various techniques and results. 
Clearly, many other relevant references have been omitted. 

Some portions of the book contain material that is directed more toward the 
serious student of identification theory than to the user. These portions are put 
either in appendixes or in sections and subsections marked with an asterisk (*). 
While occasional references to this material may be encountered, it is safe to regard 
it as optional reading; the continuity will not be impaired if it is skipped. 

The problem sections for each chapter have been organized into four groups 
of different problem types: 


xiii 
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e G problems: These could be of General interest and it may be worthwhile to 
browse through them, even without intending to solve them. 

e E problems: These are regular pencil-and-paper Exercises to check the basic 
techniques of the chapter. 

e T problems: These are Theoretically oriented problems and typically more 
difficult than the E problems. 

èe D problems: In these problems the reader is asked to fill in technical Details 
that were glossed over in the text. 
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Operators and Notational Conventions 


arg(z) = argument of the complex number z 
arg min f(x) = value of x that minimizes f(x) 


xy E€ AsF(n.m): sequence of random variables x+y converges in distribution to 
the F-distribution with n and m degrees of freedom 


xy € AsN (m. P): sequence of random variables xy converges in distribution to 
the normal distribution with mean m and covariance matrix P: see (1.17) 


xy € AS x (n): sequence of random variables xy converges in distribution to 
the x? distribution with n degrees of freedom 


Cov x = covariance matrix of the random vector x: see (1.4) 

det A = determinant of the matrix A 

dim = dimension (number of rows) of the column vector 8 

Ex = mathematical expectation of the random vector x; see (1.3) 


= 1 
Ex(t) = limyx. 5 yo, Ex(t): see (2.60) 


O(x) = ordo x: function tending to zero at the same rate as x 
o(x) = small ordo x: function tending to zero faster than x 


x € N(m. P): random variable x is normally distributed with mean m and 
covariance matrix P: see (1.6) 


Re z = real part of the complex number z 

Rf) = range of the function f = the set of values that f(x) may assume 

R? = Euclidian d-dimensional space 

x = sol{ f(x) = 0}: x ts the solution (or set of solutions) to the equation 
f(x) =0 

tr(A) = trace (the sum of the diagonal elements) of the matrix A 

Var x = variance of the random variable x 

A`! = inverse of the matrix A 

AT = transpose of the matrix A 

A-T = transpose of the inverse of the matrix A 

z = complex conjugate of the complex number z 

(superscript * is not used to denote transpose and complex conjugate: it is used 
only as a distinguishing superscript) 

yf = {y(s). ys +.1),---. x} 
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XX Operators and Notational Conventions 


vi = (A). (2). ++ ¥@)} 

Us;(w) = Fourier transform of u~“ : see (2.37) 

R.(t) = Ev(t)vl(t — t); see (2.61) 

Ryu (t) = Es(t)w T(t — t): see (2.62) 

®,.(w) = spectrum of v = Fourier transform of R, (t): see (2.63) 


P,a (w) = cross spectrum between s and w = Fourier transform of Rsu (T); see 
(2.64) 


bss 1 ; 
R“ (t) = 7 Y, s(t)s%(t — t): see (6.10) 


py (w) = estimate of the spectrum of u based on u”; see ( 6.48) 
i(t|f — 1) = prediction of v(t) based on v’! 


d 
To V(0)} = gradient of V (0) with respect to 6: a column vector of dimension 


dim @ if V is scalar valued 
V'(8) = gradient of V with respect to its argument 
t£ (€, 0) = partial derivative of £ with respect to € 
6;; = Kronecker’s delta: zero unless i = j 
ô(k) = dxo 
B(G%. £) = £ neighborhood of 0) : {0118 — Al < £} 


= = the left side is defined by the right side 
| - | = (Euclidian) norm of a vector 
\| - || = (Frobenius) norm of a matrix (see 2.91) 


SYMBOLS USED IN TEXT 


This list contains symbols that have some global use. Some of the symbols may have 
another local meaning. 

D m = set of values over which @ ranges in a model structure. See (4.122) 

D, = set into which the 6-estimate converges. See (8.23) 


e(t) = disturbance at time f: usually {e(t). t = 1,2... .} is white noise (a sequence 
of independent random variables with zero mean values and variance A) 

eo(t) = “true” driving disturbance acting on a given system S: see (8.2) 

fex). fe(x.@) = probability density function of the random variable e; see (1.2) 
and (4.4) 

G(q) = transfer function from u to v; see (2.20) 

G(q.@) = transfer function in a model structure, corresponding to the parameter 
value 8; see (4.4) 

Go(q) = “true” transfer function from u to yv for a given system; see (8.7) 

Ĝ x(q) = estimate of G(q) based on Z“ 

G*(q) = limiting estimate of G (q); see (8.71) 
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Gy (q) = difference Gy (q4) — Go(q): see (8.15) 

G = set of transfer functions obtained in a given structure: see (8.44) 

H(q), H(q.9). Holiq), Ay (q), H*(q). Hy (4). H: analogous to G but for the 
transfer function from e to y 

L(q) = prefilter for the prediction errors: see (7.10) 

Ele). £(e. 0), £(€, t. 8) = norm for the prediction errors used in the criterion; see 
(7.11), (7.16), (7.18) 

M = model structure (a mapping from a parameter space to a set of models): see 
(4.122) 

M(@) = particular model corresponding to the parameter value 0: see (4.122) 

M* = set of models (usually generated as the range of a mode! structure): see 
(4.118). 

Pg = asymptotic covariance matrix of @; see (9.11) 

q. q7! = forward and backward shift operators; see (2.15) 

S = “the true system;” see (8.7) 

T(g) = [G(q) H(q)]: see (4.109) 

T (q.@). To(q). Tw (q). Tx (q) = analogous to G and H 

u(t) = input variable at time ¢ 

Vy (6. ZY) = criterion function to be minimized: see (7.11) 

V(@) = limit of criterion function; see (8.28) 

v(t) = disturbance variable at time t 

w(t) = usually a disturbance variable at time f; the precise meaning varies with 
the local context 

x(t) = state vector at time f: dimension = n 

y(t) = output variable at time ¢ 

3(1|0) = predicted output at time t using a model M(6) and based on Z'~!: see 


(4.6) 
z(t) =[y(t) u(t) | : see (4.113) 
Z^ = {u(0), ¥(O)..... u(N), y(N)} 


€(t.9) = prediction error y(t) — ¥(7|6) 

A = used to denote variance: also, in Chapter 11. the forgetting factor; see (11.6). 
(11.63) 

6 = vector used to parametrize models: dimension = d: see (4.4), (4.5). (5.66) 

by. Oo. 6*. 6x = analogous to G 

y(t) = regression vector at time f; see (4.11) and (5.67) 

xolt) = [u(t) eo(t)]"; see (8.14) 

y(t.0) = gradient of ¥(t|9) with respect to 8: a d-dimensional column vector: 
see (4.121c) 

f(t). C(@, 8) = “the correlation vector” (instruments): see (7.110) 

T'(4, 8) = gradient of T (q.8) with respect to 0 (ad x 2 matrix): see (4.125) 
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ABBREVIATIONS AND ACRONYMS 


ARARX: See Table 4.1 

ARMA: AutoRegressive Moving Average (see Table 4.1) 
ARMAX: AutoRegressive Moving Average with eXternal input (see Table 4.1) 
ARX: AutoRegressive with eXternal input (see Table 4.1) 
BJ: Box-Jenkins model structure (see Table 4.1) 

ETFE: Empirical Transfer Function Estimate; see (6.24) 
FIR: Finite Impulse Response model (see Table 4.1) 

IV: Instrumental variables (see Section 7.6) 

LS: Least Squares (see Section 7.3) 

ML: Maximum Likelihood (see Section 7.4) 

MSE: Mean Square Error 

OE: Output error model structure (see Table 4.1) 

PDF: Probability Density Function 

PEM: Prediction-Error Method (see Section 7.2) 

PLR: PseudoLinear Regression (see Section 7.5) 

RIV: Recursive IV (see Section 11.3) 

RLS: Recursive LS (see Section 11.2) 

RPEM: Recursive PEM (see Section 11.4) 

RPLR: Recursive PLR (see Section 11.5) 

SISO: Single Input Single Output 

w.p: with probability 

w.p. 1: with probability one; see (1.15) 

w.r.t.: with respect to 


INTRODUCTION 


Inferring models from observations and studying their properties is really what sci- 
ence is about. The models (“hypotheses.” “laws of nature,” “paradigms,” etc.) may 
be of more or less formal character. but they have the basic feature that they at- 
tempt to link observations together into some pattern. System identification deals 
with the problem of building mathematical models of dynamical systems based on 
observed data from the system. The subject is thus part of basic scientific methodol- 
ogy. and since dynamical systems are abundant in our environment, the techniques 
of system identification have a wide application area. This book aims at giving an 
understanding of available system identification methods, their rationale. properties. 
and use. 


1.1 DYNAMIC SYSTEMS 


In loose terms a system is an object in which variables of different kinds interact 
and produce observable signals. The observable signals that are of interest to us are 
usually called outputs. The system is also affected by external stimuli. External 
signals that can be manipulated by the observer are called inputs. Others are 
called disturbances and can be divided into those that are directly measured and 
those that are only observed through their influence on the output. The distinction 
between inputs and measured disturbances is often less important for the modeling 
process. See Figure 1.1. Clearly the notion of a system is a broad concept, and it is 
not surprising that it plays an important role in modern science. Many problems in 
various fields are solved in a system-oriented framework. Instead of attempting a 
formal definition of the system concept, we shall illustrate it by a few examples. 
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Figure 1.1 A system with output v, input u. measured disturbance w, and 
unmeasured disturbance v. 


Example 1.1 A Solar-Heated House 


Consider the solar-heated house depicted in Figure 1.2. The system operates in such 
a way that the sun heats the air in the solar panel. The air is then pumped into a heat 
storage, which is a box filled with pebbles. The stored energy can later be transferred 
to the house. We are interested in how solar radiation and pump velocity affect the 
temperature in the heat storage. This system is symbolically depicted in Figure 1.3. 
Figure 1.4 shows a record of observed data over a 50-hour period. The variables 
were sampled every 10 minutes. o 
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Figure 1.2 A solar-heated house. 


Example 1.2 A Military Aircraft 


For the development of an aircraft, a substantial amount of work is allocated to con- 
struct a mathematical model of its dynamic behavior. This is required both for the 
simulators, for the synthesis of autopilots, and for the analysis of its properties. Sub- 
stantial physical insight is utilized, as well as wind tunnel experiments, in the course 
of this work, and a most important source of information comes from the test flights. 


Sec. 1.1 Dynamic Systems 


v: Wind, outdoor 
temperature, etc 


E Solar 
radiation y: Storage 
temperature 
u: Pump 
velocity 


Figure 1.3 The solar-heated house system: u: input: 7: measured disturbance: v: 
output: v: unmeasured disturbances. 
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Figure 1.4 Storage temperature yv. pump velocity u, and solar intensity / over a 
50-hour period. Sampling interval: 10 minutes. 
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Figure 1.5 The Swedish fighter aircraft JAS-Gripen. 


Figure 1.5 shows the Swedish aircraft JAS-Gripen, developed by SAAB AB, 
Sweden. and Figure 1.6 shows some results from test flights. Such data can be 
used to build a model of the pitch channel, i.e.. how the pitch rate is affected 
by the three control signals: elevator, canard, and leading edge flap. The ele- 
vator in this case corresponds to aileron combinations at the back of the wings, 
while separate action is achieved from the ailerons at the leading edge (the front 
of the wings). The canards are a separate set of rudders at the front of the wings. 
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Figure 1.6 Results from test flights of the Swedish aircraft JAS-Gripen. developed by 
SAAB AB. Sweden. The pitch rate and the elevator. leading edge flap. and canard angles 
are shown. 


The aircraft is unstable in the pitch channel] at this flight condition, so clearly the 
experiment was carried out under closed loop control. In Section 17.3 we will return 
to this example and identify models based on the measured data. O 
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Example 1.3 Speech 


The sound of the human voice ts generated by the vibration of the vocal chords or. 
in the case of unvoiced sounds, the air stream from the throat, and formed by the 
shape of the vocal tract. See Figure 1.7. The output of this system is sound vibration 
(i.e., the air pressure), but the external stimuli are not measurable. See Figure 1.8. 
Data from this system are shown in Figure 1.9. = 


uw 


Vocal chords 


Figure 1.7 Speech generation. 


The systems in these examples are all dynamic, which means that the current 
output value depends not only on the current external stimuli but also on their earlier 
values. Outputs of dynamical systems whose external stimuli are not observed (such 
as in Example 1.3) are often called time series. This term is especially common in 
economic applications. Clearly. the list of examples of dynamical systems can be very 
long. encompassing many fields of science. 


v: chord vibrations 
airflow 


Figure 1.8 The speech system: y: output: v: unmeasured disturbance. 
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Figure 1.9 The speech signal (air pressure). Data sampled every 0.125 ms. (8 kHz 
sampling rate). 


1.2 MODELS 


Model Types and Their Use 


When we interact with a system. we need some concept of how its variables relate to 
each other. With a broad definition. we shall call such an assumed relationship among 
observed signals a model of the system. Clearly, models may come in various shapes 
and be phrased with varving degrees of mathematical formalism. The intended 
use will determine the degree of sophistication that is required to make the model 
purposeful. 

No doubt, in daily life many systems are dealt with using: mental models. which 
do not involve any mathematical formalization at all. To drive a car, for example, 
requires the knowledge that turning the steering wheel to the left induces a left turn, 
together with subtle information built up in the muscle memory. The importance 
and degree of sophistication of the latter should of course not be underestimated. 

For certain systems it is appropriate to describe their properties using numerical 
tables and/or plots. We shall call such descriptions graphical models. Linear systems, 
for example. can be uniquely described by their impulse or step responses or by their 
frequency functions. Graphical representation of these are widely used for various 
design purposes. The nonlinear characteristics of. say. a valve are also well suited to 
be described by a graphical model. 

For more advanced applications. it may be necessary to use models that describe 
the relationships among the system variables in terms of mathematical expressions 
like difference or differential equations. We shail call such models mathematical (or 
analviical) models. Mathematical models may be further characterized by a number 
of adjectives {time continuous or time discrete, lumped or distributed, deterministic 
or stochastic, linear or nonlinear, etc.) signifying the type of difference or differential 
equation used. The use of mathematical models is inherent in all fields of engineering 
and physics. In fact, a major part of the engineering field deals with how to make good 
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designs based on mathematical models. They are also instrumental for simulation and 
forecasting (prediction). which is extensively used in all fields, including nontechnical 
areas like economy, ecology and biology. 


The modei used in a computer simulation of a system is a program. For com- 
plex systems. this program may be built up by many interconnected subroutines and 
lookup tables, and it may not be feasible to summarize it analytically as a mathemati- 
cal model. We use the term software model for such computerized descriptions. They 
have come to play an increasingly important role in decision making for complicated 
systems. 


Building Models 


Basically. a model has to be constructed from observed data. The mental model of 
car-steering dynamics. for example. is developed through driving experience. Graph- 
ical models are made up from certain measurements. Mathematical models may be 
developed along two routes (or a combination of them). One route is to split up the 
system, figuratively speaking. into subsystems, whose properties are well understood 
from previous experience. This basically means that we rely on earlier empirical 
work. These subsystems are then joined mathematically and a model of the whole 
system is obtained. This route is known as modeling and does not necessarily in- 
volve any experimentation on the actual system. The procedure of modeling is quite 
application dependent and often has its roots in tradition and specific techniques 
in the application area in question. Basic techniques typically involve structuring 
of the process into block diagrams with blocks consisting of simple elements. The 
reconstruction of the system from these simple blocks is now increasingly being done 
by computer. resulting in a software model rather than a mathematical model. 


The other route to mathematical as well as graphical models is directly based on 
experimentation. [nput and output signals from the system, such as those in Figures 
1.4, 1.6, and 1.9. are recorded and subjected to data analysis in order to infer a model. 
This route is system identification. 


The Fiction of a True System 


The real-life actual system is an object of a different kind than our mathematical 
models. In a sense. there is an impenetrable but transparent screen between our 
world of mathematical descriptions and the real world. We can look through this 
window and compare certain aspects of the physical system with its mathematical 
description. but we can never establish any exact connection between them. The 
question of nature's susceptibility to mathematical description has some deep philo- 
sophical aspects. and in practical terms we have to take a more pragmatic view of 
models. Our acceptance of models should thus be guided by “usefulness” rather 
than “truth.” Nevertheless. we shall occasionally use a concept of “the true system.” 
defined in terms of a mathematical description. Such a fiction is helpful for devis- 
ing identification methods and understanding their properties. In such contexts we 
assume that the observed data have been generated according to some well-defined 
mathematical rules, which of course is an idealization. 
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1.3 AN ARCHETYPICAL PROBLEM—ARX MODELS AND THE LINEAR 


LEAST SQUARES METHOD 


In this section we shall consider a specific estimation problem that contains most of 
the central issues that this book deals with. The section will thus be a preview of the 
book. In the following section we shall comment on the general nature of the issues 
raised here and how they relate to the organization of the book. 


The Model 


We shall generally denote the system's input and output at time f by u(t) and v(r), 
respectively. Perhaps the most basic relationship between the input and output is 
the linear difference equation: 


v(t) tFayy@ —1 +... + any —n) = buit —1I +... + bmult — m) (1.1) 


We have chosen to represent the system in discrete time, primarily since observed data 
are always collected by sampling. It is thus more straightforward to relate observed 
data to discrete time models. In (1.1) we assume the sampling interval to be one time 
unit. This is not essential. but makes notation easier. 

A pragmatic and useful way to see (1.1) is to view it as a way of determining 
the next output value given previous observations: 


y(t) = -ayt —1) —...-—ayy(t —n) + butt -—1 +...4+ b,u(t —m) (1.2) 
For more compact notation we introduce the vectors: 
T 
0 = [a ... an by ... bm] (1.3) 
g(t) =[-y@-1) ... -y@—n) uti) ... u@—m)] A) 


With these. (1.2) can be rewritten as 
yur) = 97 (16 


To emphasize that the calculation of y(t) from past data (1.2) indeed depends on the 
parameters in @. we shall rather call this calculated value }(t|@) and write 


Sle) = ge (1.5) 


The Least Squares Method 


Now suppose for a given system that we do not know the values of the parameters 
in 6, but that we have recorded inputs and outputs over a time interval 1 < 74 < N: 


ZN = {u(1). y()..... u(N). y(N)} (1.6) 


An obvious approach is then to select @ in (1.1) through (1.5) soas to fit the calculated 
values }(t|@) as well as possible to the measured outputs by the least squares method: 


min Vy (8, Z“) (1.7) 
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where 


N N 
LY 1 A ? 1 ? 
V0. 2") = = Lo — ROY = ZÈ OO — gay" (18) 


t=] 
We shall denote the value of 6 that minimizes (1.7) by Oy: 


6y = arg min Vy (8. Zz") (1.9) 


(“arg min“ means the minimizing argument, i.e., that value of 6 which minimizes 
Vy.) 

Since Vy is quadratic in 0, we can find the minimum value easily by setting the 
derivative to zero: 


N 
ba E aS PE, 
= z Vs0. Z“) = = FEDON — go) 


t=] 
which gives 


N N 
YL eMrO = Yo ewe" (1.10) 


t=] ¿=l 


or 
. N zb N 
Ôn = [Eroro] J oy) (1.11) 
i=] t=] 


Once the vectors y(t) are defined, the solution can easily be found by modern nu- 
merical software, such as MATLAB. 


Example 1.4 First Order Difference Equation 
Consider the simple model: 
y(t) + ay(t — 1) = butt — 1). 


This gives us the estimate according to (1.4), (1.3) and (1.11): 


ân | ye - 1) —S yt — Dutt — 1) 
by | «| -Eyt — Due -— w(t — 1) 


-P xexe — 1) 
x 
YL yul — 1) 
All sums are from ¢ = 1 tot = N. A typical convention is to take values outside 
the measured range to be zero. In this case we would thus take y(0) = 0. E 
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The simple model (1.1) and the well known least squares method (1.11) form the 
archetype of System Identification. Not only that—they also give the most commonly 
used parametric identification method and are much more versatile than perhaps 
perceived at first sight. In particular one should realize that (1.1) can directly be 
extended to several different inputs (this just calls for a redefinition of g(t) in (1.4)) 
and that the inputs and outputs do not have to be the raw measurements. On the 
contrary—it is often most important to think over the physics of the application 
and come up with suitable inputs and outputs for (1.1), formed from the actual 
measurements. 


Example 1.5 An Immersion Heater 


Consider a process consisting of an immersion heater immersed in a cooling liquid. 
We measure: 


e u(t): The voltage applied to the heater 
è r(t): The temperature of the liquid 
e y(t): The temperature of the heater coil surface 
Suppose we need a model for how y(t) depends on r(f) and u(t). Some simple con- 


siderations based on common sense and high school physics (“Semi-physical model- 
ing”) reveal the following: 


e The change in temperature of the heater coil over one sample is proportional 
to the electrical power in it (the inflow power) minus the heat loss to the liquid 


e The electrical power is proportional to v? (t) 
è The heat loss is proportional to y(t) — r(t) 


This suggests the model: 
yQ) = ye — 1) Favre — 1) - BOW -1 ~ re 1) 
which fits into the form 
y(t) + Ayt — 1) = hrl — 1) + Osr(t — 1) 
This is a two input (v? and r) and one output model, and corresponds to choosing 
g(t) =[-yt—1) wi —1) re- 1) 


in (1.5). We could also enforce the suggestion from the physics that 6; + 63 = —1 
by another choice of variables. 


Linear Regressions 


Model structures such as (1.5) that are linear in @ are known in statistics as linear 
regressions. The vector y(t) is called the regression vector, and its components 
are the regressors. “Regress” here alludes to the fact that we try to calculate (or 
describe) y(t} by “going back” to y(t). Models such as (1.1) where the regression 
vector—g(t)—contains old values of the variable to be explained— y (t }—are then 


Sec. 1.3. An Archetypical Problem—ARX Models and the Linear Least Squares Method 11 


partly auto-regressions. For that reason the model structure (1.1) has the standard 
name ARX-model: Auto-Regression with eXtra inputs (or eXogeneous variables). 


There is a rich statistical literature on the properties of the estimate As: under 
varying assumptions. We shall deal with such questions in a much more general 
setting in Chapters 7 to 9. and the following section can be seen as a preview of the 
issues dealt with there. See also Appendix II. 


Model Quality and Experiment Design 


Let us consider the simplest special case. that of a Finite Impulse Response (FIR) 
model. That is obtained from (1.1) by taking n = Q: 


y(t) = bult — 1) +... + bwuit — m) (1.12) 
Suppose that the observed data really have been generated bv a similar mechanism 
v(t) = bult —-1l)+...+ bult ~m) + e(r) (1.13) 


where e(f) is a White noise sequence with variance A. but otherwise unknown. (That 
is. e(t} can be described as a sequence of independent random variables with zero 
mean values and variances 4.) Analogous to (1.5). we can write this as 


yt) = plt) + elt) (1.14) 


The input sequence u (f), t = 1,2.... is taken as a given. deterministic, sequence of 
numbers. We can now replace y(f) in (1.12) by the above expression. and obtain 


X À 
Ôx = RIN) | XO (te (1G + Y vire(t) 


i=l t=i 
N 


RIN) = ` pT) 
=l 


or 
N 
By = Oy — M = RIN)! È geir) (1.15) 


f=] 


Since u and hence ¢@ are given. deterministic variables. R(N) is a deterministic 
matrix. If E denotes mathematical expectation, we therefore have 


N N 
EG, = E| RINY! X viet) = R(N)7! > vin Ee(t) = 0 


t=] f=1 


since e(t) has zero mean. The estimate is consequently unbiased. 
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We can also form the expectation of On Ox. i.e., the covariance matrix of the 
parameter error 


a 
Py = E6y6y = ER(N)' Ñ` geleli Ts) RN 


ts=l 
N 
= RIN)! D> git)y™(s)R(N) 'Eelels) 


t.s=l 


R 
RIN) XO PORIN ASC — 5) 


t s=1 


N 
= RIN) AS pY TARN Y! = ARIN (1.16) 
t=1 


where we used the fact that e is a sequence of independent variables so that Ee(t)e(s) 
= Ad(t — s), with 6(0) = 1 and 6(r) = Oif t £0. 

We have thus computed the covariance matrix of the estimate by. It is de- 
termined entirely by the input properties R(N) and the noise level A. Moreover, 
define 


N70 


R = lim TRN) (1.17) 


This will correspond to the covariance matrix of .he input. i.e., the i — j-element of 
Ris 


N 
paal , : 
dim y 2 u(t — iju(t — j) 


If the matrix R is non-singular, we find that the covariance matrix of the parameter 
estimate is approximately given by 


A Ro 
N 
and the approximation improves as N — oo. A number of things follow from this. 


(1.18) 


e The covariance decays like 1/N , so the parameters approach the limiting value 
at the rate 1/ V N. 


e The covariance is proportional to the Noise-To-Signal ratio. That is. it is pro- 
portional to the noise variance A and inversely proportional to the input power. 


e The covariance does not depend on the input’s or noise’s signal shapes, only on 
their variance/covariance properties. 


e Experiment design, i.e., the selection of the input u, aims at making the matrix 


R | “as small as possible.” Note that the same R can be obtained for many 
different signals ze. 
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1.4 THE SYSTEM IDENTIFICATION PROCEDURE 


Three Basic Entities 
As we Saw in the previous section, the construction of a model from data involves 
three basic entities: 


1. 
2. 
3. 


A data set. like Z“ in (1.6). 
A set of candidate models: a Model Structure. like the set (1.1) or (1.5). 


A rule by which candidate models can be assessed using the data. like the Least 
Squares selection rule (1.9). 


Let us comment on each of these: 


1. 


The data record. The input-output data are sometimes recorded during a 
specifically designed identification experiment, where the user may determine 
which signals to measure and when to measure them and mav also choose the 
input signals. The objective with experiment design is thus to make these choices 
so that the data become maximallv informative. subject to constraints that may 


be at hand. Making the matrix Rin (1.18) small is a typical example of this, 
and we shali in Chapter 13 treat this question in more detail. In other cases the 
user mav not have the possibility to affect the experiment. but must use data 
from the normal operation of the system. 


The set of models or the model structure. A set of candidate models is obtained 
by specifving within which collection of models we are going to look for a 
suitable one. This is no doubt the most important and, al the same time, the 
most difficult choice of the system identification procedure. It is here that a 
priori knowledge and engineering intuition and insight have to be combined 
with formal properties of models. Sometimes the model set is obtained after 
careful modeling. Then a model with some unknown physical parameters is 
constructed from basic physical laws and other well-established relationships. 
In other cases standard linear models may be employed, without reference to 
the physical background. Such a model set. whose parameters are basically 
viewed as vehicles for adjusting the fit to the data and do not reflect phvsical 
considerations in the system. is called a black box. Model sets with adjustable 
parameters with physical interpretation may, accordingly, be called grav boxes. 
Generally speaking, a model structure is a parameterized mapping from past 
inputs and outputs Z‘~! (cf (1.6)) to the space of the model outputs: 


Fle) = g0. Z) (1.19) 


Here 8 is the finite dimensional vector used to parameterize the mapping. 
Chapters 4 and 5 describe common model structures. 


Determining the “best” model in the set, guided by the data. This is the iden- 
tification method. The assessment of model quality is typically based on how 
the models perform when they attempt to reproduce the measured data. The 
basic approaches to this will be dealt with independently of the model structure 
used. Chapter 7 treats this problem. 
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Model Validation 


After having settled on the preceding three choices. we have, at least implicitly. 
arrived at a particular model: the one in the set that best describes the data according 
to the chosen criterion. It then remains to test whether this model is “good enough,” 
that is, whether it is valid for its purpose. Such tests are known as model validation. 
They involve various procedures to assess how the model relates to observed data, to 
prior knowledge. and to its intended use. Deficient model behavior in these respects 
make us reject the model, while good performance will develop a certain confidence 
in it. A model can never be accepted as a final and true description of the system. 
Rather, it can at best be regarded as a good enough description of certain aspects that 
are of particular interest to us. Chapter 16 contains a discussion of model validation. 


The System Identification Loop 


The system identification procedure has a natural logical flow: first collect data, then 
choose a model set, then pick the “best” model in this set. It is quite likely, though. 
that the model first obtained will not pass the mode} validation tests. We must then 
go back and revise the various steps of the procedure. 

The model may be deficient for a variety of reasons: 


e The numerical procedure failed to find the best model according to our crite- 
rion. 


e The criterion was not well chosen. 


e The model set was not appropriate, in that it did not contain any “good enough” 
description of the system. 


e The data set was not informative enough to provide guidance in selecting good 
models. 


The major part of an identification application in fact consists of addressing these 
problems, in particular the third one, in an iterative manner, guided by prior infor- 
mation and the outcomes of previous attempts. See Figure 1.10. Interactive software 
obviously is an important tool for handling the iterative character of this problem. 


1.5 ORGANIZATION OF THE BOOK 


To master the loop of Figure 1.10, the user has to be familiar with a number of things. 


1. Available techniques of identification and their rationale, as well as typical 
choices of model sets. 


2. The properties of the identified model and their dependence on the basic items: 
data, model set, and identification criterion. 


3. Numerical schemes for computing the estimate. 


4. How to make intelligent choices of experiment design, model set, and identifi- 
cation criterion, guided by prior information as well as by observed data. 
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Figure 1.10 The system identification loop. 


In fact. a user of system identification may find that he or she is primarily a user 
of an interactive software package. Items 1 and 3 are then part of the package and 
the important thing is to have a good understanding of item 2 so that task 4 can be 
successfully completed. This is what we mean by “Theory for the User” and this is 
where the present book has its focus. 

The idea behind the book’s organization is to present the list of common and 
useful model sets in Chapters 4 and5. Available techniques are presented in Chapters 
6 and 7, and the analysis follows in Chapters 8 and 9. Numerical techniques for off- 
line and on-line applications are described in Chapters 10 and 11. Task 4, the user’s 
choices, is discussed primarily in Chapters 13 through 16, after some preliminaries 
in Chapter 12. In addition, Chapters 2 and 3 give the formal setup of the book, 
and Chapter 17 describes and assesses system identification as a tool for practical 
problems. 

Figure 1.11 illustrates the book’s structure in relation to the loop of system 
identification. 
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Figure 1.11 Organization of the book. 


About the Framework 


The system identification framework we set up here is fairly general. It does not 
confine us to linear models or quadratic criteria or to assuming that the system itself 
can be described within the model set. Indeed, this is one of the points that should 
be stressed about our framework. Nevertheless. we often give proofs and explicit 
expressions only for certain special cases. like single-input, single-output systems and 
quadratic criteria. The purpose is of course to enhance the underlying basic ideas 
and not conceal them behind technical details. References are usually provided for 
more general treatments. ; 

Parameter estimation and identification are usually described within a proba- 
bilistic framework. Here we basically employ such a framework. However, we also 
try to retain a pragmatic viewpoint that is independent of probabilistic interpreta- 
tions. That is. the methods we describe and the recommendations we put forward 
should make sense even without the probabilistic framework that may motivate them 
as “optimal solutions.” The probabilistic and statistical environments of the book 
are described in Appendices I and II, respectively. These appendices may be read 
prior to the other chapters or consulted occasionally when required. In any case, the 
book does not lean heavily on the background provided there. 


1.6 BIBLIOGRAPHY 


The literature on the system identification problem and its ramifications is exten- 
sive. Among general textbooks on the subject we may mention Box and Jenkins 
(1970). Eykhoff (1974), Spriet and Vansteenkiste (1982). Ljung and Glad (1994a). 
and Johansson (1993)for treatments covering several practical issues, while Good- 
win and Pavne (1977). Davis and Vinter (1985). Hannan and Deistler (1988), Caines 
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(1988), Chen and Guo (1991). and Söderström and Stoica (1989)give more theoret- 
ically oriented presentations. Kashyap and Rao (1976), Rissanen (1989)and Bohlin 
(1991 emphasize the role of model validation and model selection in their treatment 
of system identification, while Söderström and Stoica (1983)focuses on instrumental- 
variable methods. A treatment based on frequency domain data is given in Schoukens 
and Pintelon (1991). and the so-called subspace approach is thoroughly discussed in 
Van Overschee and DeMoor (1996). Texts that concentrate on recursive identi- 
fication techniques include Ljung and Söderström (1983), Solo and Kong (1995). 
Haykin (1986). Widrow and Stearns (1985). and Young (1984). Spectral analysis is 
closely related. and treated in many books like Marple (1987), Kay (1988)and Stoica 
and Moses (1997). Statistical treatments of time-series modeling such as Anderson 
(1971). Hannan (1970). Brillinger (1981), and Wei (1990)are most relevant also for 
the system identification problem. The co-called behavioral approach to modeling 
is introduced in Willems (1987). 

Among edited collections of articles. we may refer to Mehra and Lainiotis 
(1976). Evkhoff (1981}, Hannan. Krishnaiah, and Rao (1985). and Leondes (1987). 
as well as to the special journal issues Kailath, Mehra. and Mayne (1974). Isermann 
(1981), Eykhoff and Parks (1990). Kosut. Goodwin. and Polis (1992)and Söderström 
and Åström (1995). The proceedings from the IFAC (International Federation of 
Automatic Control) Symposia on Identification and System Parameter Estimation 
contain many articles on all aspects of the system identification problem. These 
symposia are held every three years. starting in Prague 1967. 

Philosophical aspects on mathematical models of real-life objects are discussed. 
for example. in Popper (1934). Modeling from basic physical laws. rather than from 
data, is discussed in many books: see. for example. Wellstead (1979), Ljung and Glad 
(1994a). Frederick and Close (1978)and Cellier (1990)for engineering applications. 
Such treatments are important complements to the model set selection (see Section 
1.3 and Chapter 16). 

Many books discuss modeling and identification in various application areas. 
See. for example, Granger and Newbold (1977)or Malinvaud (1980}{econometrics), 
Godfrey (1983)(biology). Robinson and Treitel (1980), Mendel (1983)(geoscience), 
Dudley (1983)(electromagnetic wave theory), Markel and Gray (1976){speech sig- 
nals) and Beck and Van Straten (1983)(environmental systems). Rajbman (1976, 
1981 )has surveyed the Soviet literature. 


I systems and models 


2 


TIME-INVARIANT LINEAR 
SYSTEMS 


Time-invariant lincar systems no doubt form the most important class of dynamical 
systems considered in practice and in the literature. It is true that they represent ide- 
alizations of the processes encountered in real life. But. even so, the approximations 
involved are often justified. and design considerations based on linear theory lead to 
good results in many cases. 

A treatise of linear systems theory is a standard ingredient in basic engineering 
education. and the reader has no doubt some knowledge of this topic. Anyway, in this 
chapter we shall provide a refresher on some basic concepts that will be instrumental 
for the further development in this book. In Section 2.1 we shall discuss the impulse 
response and various ways of describing and understanding disturbances, as well as 
introduce the transfer-function concept. In Section 2.2 we study frequency-domain 
interpretations and also introduce the periodogram. Section 2.3 gives a unified setup 
of spectra of deterministic and stochastic signals that will be used in the remainder 
of this book. In Section 2.4 a basic ergodicity result is proved. The development 
in these sections is for systems with a scalar input and a scalar output. Section 2.5 
contains the corresponding expressions for multivariable systems., 


2.1 IMPULSE RESPONSES, DISTURBANCES, AND TRANSFER 
FUNCTIONS 


Impulse Response 


Consider a system with a scalar input signal u(t} and a scalar output signal vit} 
(Figure 2.1). The system is said to be time invariant if its response to a certain input 
signal does not depend on absolute time. 111s said tu be Hacer if its output response to 
a linear combination of inputs is the same linear combination of the output responses 
of the individual inputs. Furthermore, it is said to be causal if the output at a certain 
time depends on the input up to that time only. 
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Figure 2.1 The system. 


Itis well known that a linear, time-invariant. causal system can be described by 
its pnpulse response (or weighting function) g4) as follows: 


N 
y S= J e(tiudt — t)dT (2.1) 
T 


=) 


Knowing {g(r )}2%,, and knowing u(s) for s < t. we can consequently compute the 


corresponding output vis). § < t for any input. The impulse response is thus a 
complete characterization of the svstem. 


Sampling 


Tn this book we shall almost exclusively deal with observations of inputs and outputs 
in discrete time. since this is the typical data-acquisition mode. We thus assume v(t) 
to be observed at the samipling instants tg = KT. k = 1.2... 


~ 
vAT) = J gituikT — t)dt (2.2) 


The interval 7 will be called the sampling interval. It is. of course. also possible to 
consider the situation where the sampling instants are not equally spread. 

Most often. in computer control applications. the input signal u(t) is kept con- 
stant between the sampling instants: 


u(t) = ty. AT <t < (K+ 1)T (2.3) 
This is mostly done for practical implementation reasons. but it will also greatly 


simplify the analysis of the system. Inserting (2.3) into (2.2) gives 


x ~ iT 
vT) = f guik T — tdt = pD f g(TjuülkT — t)dt 
Jr=0 -LT 


e tll 


(2.4) 
xN ‘T ` 
= 2 p gird | yj. = X gr (Eun. 
=] LY 75N- My =i 
where we defined 
iT 
sith = f gitjdt (2.5) 


a t=- NF 
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The expression (2.4) tells us what the output will be at the sampling instants. Note 
that no approximation is involved if the input is subject to (2.3) and that it is sufficient 
to know the sequence {g7(@)}7<, in order to compute the response to the input. The 
relationship (2.4) describes a sampled-data system, and we shall call the sequence 
{gr (€)}7<, the impulse response of that system. 

Even if the input is not piecewise constant and subject to (2.3), the representa- 
tion (2.4) might still be a reasonable approximation. provided u(t) does not change 
too much during a sampling interval. See also the following expressions (2.21) to 
(2.26). Intersample behavior is further discussed in Section 13.3. 

We shall stick to the notation (2.3) to (2.5) when the choice and size of T 
are essential to the discussion. For most of the time, however. we shall for ease of 
notation assume that T is one time unit and use 7 to enumerate the sampling instants. 
We thus write for (2.4) 


x 


rt) = Dogue k). t = 01.2... (2.6) 
k=l 


For sequences, we shall also use the notation 
yf = Oy + D... WO) (2.7) 


and for simplicity 


HEr 


Disturbances 


According to the relationship (2.6). the output can be exactly calculated once the 

input is known. In most cases this is unrealistic. There are always signals beyond 

our control that also affect the system. Within our linear framework we assume that 

such effects can be lumped into an additive term v(t) at the output (see Figure 2.2): 
x 

y(t) = > g(kyu(t — k) + vie) (2.8) 


k=l 


u(t) v(t) 


Figure 2.2 System with disturbance. 
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There are many sources and causes for such a disturbance term. We could list: 


e Measurement noise: The sensors that measure the signals are subject to noise 
and drift. 


è Uncontrollable inputs: The system is subject to signals that have the character 
of inputs. but are not controllable by the user. Think of an airplane. whose 
movements are affected by the inputs of rudder and aileron deflections. but 
also by wind gusts and turbulence. Another example could be a room. where 
the temperature is determined by radiators. whose effect we control, but also 
bv people (œ= 100 W per person) who may move in and out in an unpredictable 
manner. 


The character of the disturbances could also vary within wide ranges. Classical ways 
of describing disturbances in control have been to study steps. pulses. and sinusoids. 
while in stochastic control the disturbances are modeled as realizations of stochastic 
processes. See Figures 2.3 and 2.4 for some typical, but mutually quite different. 
disturbance characteristics. The disturbances may in some cases be separately mea- 
surable. but in the typical situation they are noticeable only via their effect on the 
output. If the impulse response of the system is known. then of course the actual 
value of the disturbance v(t) can be calculated from (2.8) at time t. 


0 20 40 60 80 100 120 140 160 180 200 


Figure 2.3 Room temperature. 


The assumption of Figure 2.2 that the noise enters additively to the output 
implies some restrictions. Sometimes the measurements of the inputs to the system 
may also be noise corrupted (“error-in-variable” descriptions). In such cases we 
take a pragmatic approach and regard the measured input values as the actual inputs 
u(t) to the process, and their deviations from the true stimuli will be propagated 
through the system and lumped into the disturbance u(t) of Figure 2.2. 
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3.05 


2.8 
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Figure 2.4 Moisture content in paper during paper-making. 


Characterization of Disturbances 


The most characteristic feature of a disturbance is that its value is not known be- 
forehand. Information about past disturbances could, however, be important for 
making qualified guesses about future values. It is thus natural to employ a prob- 
abilistic framework to describe future disturbances. We then put ourselves at time 
t and would like to make a statement about disturbances at times t +k, k > 1. 
A complete characterization would be to describe the conditional joint probability 
density function for {v(t + k), k > 1}, given {v(s), 5 < t}. This would, however, in 
most cases be too laborious, and we shall instead use a simpler approach. 
Let v(t) be given as 


v(t) = Do hike(t — k) (2.9) 


k=0 


where {e(f}} is white noise, i.e., a sequence of independent (identically distributed) 
random variables with a certain probability density function. Although this descrip- 
tion does not allow completely general characterizations of all possible probabilistic 
disturbances, it is versatile enough for most practical purposes. In Section 3.2 we 
shal! show how the description (2.9) allows predictions and probabilistic statements 
about future disturbances. For normalization reasons. we shail usually assume that 
h(0) = 1, which is no loss of generality since the variance of e can be adjusted. 

It should be made clear that the specification of different probability density 
functions (PDF) for {e(t}} may result in very different characteristic features of the 
disturbance. For example, the PDF 


e(t) = 0, with probability 1 — u 


(2.10) 
e(t) 


r, with probability u 
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0 50 


Figure 2.5 A realization of the process (2.9) with e subject to (2.10). 


where r is a normally distributed random variable: r € N(0, y} leads to, if u isa 
small number, disturbance sequences with characteristic and “deterministic” profiles 
occurring at random instants. See Figure 2.5. This could be suitable to describe 
“classical” disturbance patterns, steps, pulses, sinusoids, and ramps (cf. Figure 2.3!). 
On the other hand, the PDF 


e(t) € N(Q.A) (2.11) 


gives a totally different picture. See Figure 2.6. Such a pattern is more suited to 
describe measurements noises and irregular and frequent disturbance actions. 

Often we only specify the second-order properties of the sequence {e(r)}, that 
is, the mean and the variances. Note that (2.10) and (2.11) can both be described as 
“a sequence of independent random variables with zero mean values and variances 
à” [A = py for (2.10)], despite the difference in appearance. 

Remark. Notice that {e(t)} and {v(t)} as defined previously are stochastic 
processes (i.e., sequences of random variables), The disturbances that we observe 
and that are added to the system output as in Figure 2.2 are thus realizations of 
the stochastic process {v(t)}. Strictly speaking, one should distinguish in notation 
between the process and its realization, but the meaning is usually clear from the 
context, and we do not here adopt this extra notational burden. Often one has oc- 
casion to study signals that are mixtures of deterministic and stochastic components. 
A framework for this will be discussed in Section 2.3. 


Covariance Function 


In the sequel, we shall assume that e(t) has zero mean and variance A. With the 
description (2.9) of v(t), we can compute the mean as 


Ev(t) = X hk) Eel -k)=0 (2.12) 


k=0 
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Figure 2.6 A realization of the same process (2.9) as in Figure 2.5. but with e 


subject to (2.12). 


and the covariance as 


X% XxX 


Ev(t)ut — t) = SO J hbhls) Eelt — kjelt — t - 5) 


k=0 s=0 


x x 
be XO 9 Ak) A(S)B(K —t—s)a 
k=0 s=0 


x 


= A Y hhk — 1) 


k=0 


(2.13) 


Here h(r) = 0 ifr < 0. We note that this covariance is independent of ¢ and call 


R,(t) = Ev(t)et — T) 


(2.14) 


the covariance function of the process v. This function, together with the mean. 
specifies the second-order properties of v. These are consequently uniquely defined 
by the sequence {#(k)} and the variance A of e. Since (2.14) and Ev(t) do not 


depend on ż. the process is said to be stationary. 


Transfer Functions 


It will be convenient to introduce a shorthand notation for sums like (2.8) and (2.9), 
which will occur frequently in this book. We introduce the forward shift operator q 


by 
qu(t) = u(t + 1) 
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and the backward shift operator q7!: 


q 'u(t) = u(t — 1) 


We can then write for (2.6) 


X gaOut - k) = X atk) (q-*u(t)) 


y(t) = 
k=] k=l 
(2.15) 
i 
= [E swa] u(t) = G(q)u(t) 
k=] 
where we introduced the notation 
x 
Gq) = Do gg (2.16) 
k=l 


We shall call G (q) the transfer operator or the transfer function of the linear system 
(2.6). Notice that (2.15) thus describes a relation between the sequences uw’ and v’. 


Remark. We choose q as argument of G rather than q~} (which perhaps 
would be more natural in view of the right side) in order to be in formal agreement 
with z-transform and Fourier-transform expressions. Strictly speaking. the term 
transfer function should be reserved for the z-transform of {g(k)}{*. that is. 


x 
GQ) = $ ghz" (2.17) 
k=1 
but we shall sometimes not observe that point. o 
Similarly with 
3c 
H(q) = X h)g (2.18) 
=0 
we can write 
u(t) = H(q)elt) (2.19) 


for (2.9). Our basic description for a linear system with additive disturbance will thus 
be 


y(t) = G(q)u(t) + A(q)e(t) 


with {e()} as a sequence of independent random variables with zero mean values 
and variances A. 


26 


Chap.2  Time-Invariant Linear Systems 


Continuous-time Representation and Sampling Transfer 
Functions {*) 


For many physical svstems it is natural to work with a continuous-time representation 
(2.1). since most basic relationships are expressed in terms of differential equations. 
With G, (s) denoting the Laplace transform of the impulse response function {g(t )} 
in (2.1). we then have the relationship 


Yis) = G (sU (s) (2.21) 


between Y (s) and U (s). the Laplace transforms of the output and input, respectively, 
Introducing p as the differentiation operator, we could then write 


yt) = G (pur) (2.22) 


as a shorthand operator form of (2.1) or its underlying differential equation. Now. 
(2.1) or (2.22) describes the output at all values of the continuous time variable z. 
If {u(t)} is a known function (piecewise constant or not). then (2.22) will of course 
also serve as a description of the output of the sampling instants. We shall therefore 
occasionally use (2.22) also as a system description for the sampled output values. 
keeping in mind that the computation of these values will involve numerical solution 
of a differential equation. In fact, we could still use a discrete-time model (2.9) for 
the disturbances that influence our discrete-time measurements, writing this as 


y(t) = G.(pyu(t) + Hiqet). a —— a a (2.23) 


Often. however. we shall go from the continuous-time representation (2.22) to the 
standard discrete-time one (2.15) by transforming the transfer function 


Gp) > Gr(q) (2.24) 


T here denotes the sampling interval. When the input is piecewise constant over 
the sampling interval. this can be done without approximation. in view of (2.4). 
See Problem 2G.4 for a direct transfer-function expression, and equations (4.67) to 
(4.71). for numerically more favorable expressions. One can also apply approximate 
formulas that correspond to replacing the differentiation operator p by a difference 
approximation. We thus have the Euler approximation 


qg—l 
Gi(q) © c.(4 7 ) (2.25) 
and Tustin’s formula 
Grip = G. (54) (2.26) 
rd NF ce q i 


See Astrém and Wittenmark (1984)for a further discussion. 


(* )Denotes sections and subsections that are optional reading: they can be omitted without serious 
loss of continuity. See Preface. 
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Some Terminology 


The function G(z) in (2.17) is a complex-valued function of the complex variable 
z. Values £;. such that G(f;) = Q. are called zeros of the transfer function (or 
of the system), while values a; for which G(z) tends to infinity are called poles. 
This coincides with the terminology for analvtic functions (see. e.g.. Ahlfors. 1979). 
If G(z) is a rational function of z. the poles will be the zeros of the denominator 
polynomial. 

We shall say that the transfer function G(g) (or “the system G“ or “the filter 
G`} is stable if 


XW X 
Gq) = X gig. YOg < œ (2.27) 
k=1 k=l 


The definition (2.27) coincides with the system theoretic definition of bounded-input. 
bounded-output (BIBO) stability (e.g. Brockett. 1970): If an input {u(r)} to G(g) 
is subject to |u(r){ < C. then the corresponding output z(t) = G(q)u(t) will also 
be bounded, |z(t)| < C“, provided (2.27) holds. Notice also that (2.27) assures that 


the (Laurent) expansion 
X% 


G(z) = D g(k)z 
k=] 


is convergent for all |z| > 1. This means that the function G(z) is analytic on and 
outside the unit circle. In particular. it then has no poles in that area. 
We shail often have occasion to consider families of filters Gg(g).a@ € A: 


2C 
Galq) = X galk)q™, acA (2.28) 
k=1 


We shall then say that such a family is uniformly stable if 


x 
Iga(k)| < gik), Va € A.Y gtk) < æ (2.29) 
k=l 


Sometimes a slightly stronger condition than (2.27) will be required. We shall say 
that G(q) is strictly stable if 


X 
Yo klgtk)| < æ (2.30) 
k=] 
Notice that. for a transfer function that is rational in q. stability implies strict stability 
(and. of course, vice versa). See Problem 2T.3. 
Finally. we shall say that a filter H (q) is monic if its zeroth coefficient ts 1 (or 
the unit matrix): 


x 
Hig) = Y hqg. hh) = 1 (2.31) 
k=0 
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2.2 FREQUENCY-DOMAIN EXPRESSIONS 
Sinusoid Response and the Frequency Function 
Suppose that the input to the system (2.6) is a sinusoid: 
u(t) = cos wt (2.32) 
It will be convenient to rewrite this as 
u(t) = Ree’ 


with Re denoting “real part.” According to (2.6). the corresponding output will be 


X i x 
X g(k)Re elU) = Re XO (kyle 


y(t) 


x 
Re | i Semen] = Re {ei 7 G(e'”)} (2.33) 


k=1 


IG(e'”)| cos(wt + p) 
where 

g = arg G(e") (2.34) 
Here. the second equality follows since the g(k) are real and the fourth equality 


from the definition (2.16) or (2.17). The fifth equality follows straightforward rules 
for complex numbers. 


In (2.33) we assumed that the input was a cosine since time minus infinity. If 
u(t) = 0.t < 0. we obtain an additional term 


x 
—Re eit Sewers] 


k=t 


in (2.33). This term is dominated by 


> igtk)| 


k=t 
and therefore is of transient nature (tends to zero as ¢ tends to infinity), provided 
that G(q) is stable. 
In any case. (2.33) tells us that the output to (2.32) will also be a cosine of the 
same frequency. but with an amplitude magnified by |G(e'®)| and a phase shift of 
arg G(e'”) radians. The complex number 


Gel”) (2.35) 


which is the transfer function evaluated at the point z = e”, therefore gives full 
information as to what will happen in stationarity, when the input is a sinusoid of 
frequency w. For that reason, the complex-valued function 


Gel’), -a <w<a (2.36) 
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is called the frequency function of the system (2.6). It is customary to graphically 
display this function as log |G(e'”)| and arg G (e'®) plotted against log w in a Bode 
plot. The plot of (2.36) in the complex plane is called the Nyquist plot. These 
concepts are probably better known in the continuous-time case. but all their basic 
properties carry over to the sampled-data case. 


Periodograms of Signals over Finite Intervals 


Consider the finite sequence of inputs u(t). tf = 1,2...., N. Let us define the 
function Ux (w) by 
1 N 
Un(@) = — u(tye (2.37) 
FÈ 
The values obtained for @ = 2xk/N. k = 1..... N. form the familiar discrete 


Fourier transform (DFT) of the sequence u” . We can then represent u(t) by the 
inverse DFT as 


u(t) = J 3 Un (2m k/N je TEN (2.38) 


To prove this, we insert (2.37) into the right side of (2.38), giving 
= i2akt 
1 y une (- - exp N 
k=] s=l 
2 t— 
= Ho Dow poe iaht =] 


N 
= > lu(s)N . ô(t — s) = u(t) 


s=1 


Here we used the relationship 


<r <N 


Ñ 
l WO 2rirkiN l r=0 
ro TE (2.39) 
From (2.37) we note that Ux (w) is periodic with period 27: 


Ux (wm + 277) = Un (ow) (2.40) 


Also. since u(t} is real, 
Un(—@) = Ux (a) (2.41) 
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where the overbar denotes the complex conjugate. The function Un (w) is therefore 
uniquely defined by its values over the interval [0, 7]. It is, however. customary 
to consider Uy(w) for -x < w < 7, and in accordance with this, (2.38) is often 
written 

N2 


1 MEPER 
u(t) = == XO UnQak/ Nji (2.42) 
VN yo ot 


making use of (2.40) and the periodicity of e. In (2.42) and elsewhere we assume 
N to be even: for odd N analogous summation boundaries apply. 

In (2.42) we represent the signal u(t) as a linear combination of ef“! for N 
different frequencies w. As is further elaborated in Problem 2D.1. this can also 
be rewritten as sums of coswt and sin wt for the same frequencies, thus avoiding 
complex numbers. 

The number Ux (277k /N) tells us the “weight” that the frequency w = 27k /N 
carries in the decomposition of fal}. Its absolute square value |Uy(27k/N yl? 
is therefore a measure of the contribution of this frequency to the “signal power.” 
This value 


[Un (@)/? (2.43) 


is known as the periodogram of the signal u(t). t = 1,2,...,N. 
Parseval’s relationship, 


N N 
do UnQak/ N? = $ wee) (2.44) 
t=1 


k=1 


reinforces the interpretation that the energy of the signal can be decomposed into 
energy contributions from different frequencies. Think of the analog decomposition 
of light into its spectral components! 


Example 2.1 Periodogram of a Sinusoid 


Suppose that 
u(t) = ACOS wot (2.45) 


where wp = 22 / No for some integer No > 1. Consider the interval = 1,2...., N. 
where N is a multiple of Ny: N = s - No. Writing 


TF F 
COS wf = 5 le ont r sa 


gives 
ey ee 
Un(w) = — Y= [eito + enn) 
VJ 2 


i=l 
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Using (2.39), we find that 
2 2 23 
[Uv (@)? = No N (2.46) 
27x k 
0, if@ = WwW k = S 


The periodogram thus has two spikes in the interval [—z, 7]. Z 


Example 2.2 Periodogram of a Periodic Signal 


Suppose u(t) = u(t + No) and we consider the signal over the interval [1, N]. 
N =s - No. According to (2.42), the signal over the interval [1, No] can be written 


No/2 


1 Sat AE 
E ea A; 2xitr/ No 2.47 
u(t) = ae yo Are (2.47) 


r=—No/2+1 
with 


No 
1 2ritr{/N, 

p = ee Vile tri No (2.48) 
No t=1 


Since u is periodic, (2.47) applies over the whole interval [1, N]. Itis thus a sum of No 
sinusoids, and the results of the previous example (or straightforward calculations) 


A 


show that 
2 N 
HAr ios F SO eee 
\Uv(@)? = ano 2 (2.49) 
0, ifm = ee k # r:-s 


0 


The periodograms of Examples 2.1 and 2.2 turned out to be well behaved. For signals 
that are realizations of stochastic processes, the periodogram is typically a very erratic 
function of frequency. See Figure 2.8 and Lemma 6.2. 


Transformation of Periodograms(*) 


As a signal is filtered through a linear system, its periodogram changes. We show 
next how a signal's Fourier transform is affected by linear filtering. Results for the 
transformation of periodograms are then immediate. 


Theorem 2.1. Let {s(t)} and {w(t)} be related by the strictly stable system G (q): 


s(t) = G(qg)w(t) (2.50) 
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The input u(t) for t < 0 is unknown, but obeys |w(r)| < C, for all t. Let 


N 
Da 
Sv(w) = —= ) sie“ (2.51) 
TWL 
l N 
Wa(w) = —Y wre to (2.52) 
Ti 2 
Then 
Sylw) = Gle'?)Wa(w) + Ryw) (2.53) 
where 
Cg 
IRy(w)| < 2C,. - —= (2.54) 
N (@)| TN 
with 
x 
Co = > klgík)| (2.55) 
k=l 


Proof. We have by definition 


N x N 
l ç , 1 l 
Sn (w) == sire = — (k)w(t = kje” 
= [change variables: t — k = zq] 
i= N-k 
= Goes YO atkye he . >, u(rje tt? 
VN k=l t=!1—k 
Now 
1 N- 
Wu(o) - — w(tje™ t” 
N t=i—k 
l Q | 1 N 
< {a >a w(rje tt + Li 5 u(tje ft 
| N t=1—k JN N—k+1 
2 
Se re (2.56) 
N 
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Hence 


|Sy(w) — Ge) Wy low] = Pathe 5 wire” — Wy(a) 
FÈ k 
2. a ICa Ce 
< — Ņ ` kglk)Cre ".] < —“~—* 
Ja de i JN 


and (2.53) to (2.55) follow. 


(] 


Corollary. Suppose {w(t)} is periodic with period N. Then Rx(%w) in (2.53) is 
zero for w = 27 k/N. 


Proof. The left side of (2.56) is zero for a periodic w(t) at w = 27k/N. Z 


2.3 SIGNAL SPECTRA 


The periodogram defines, in a sense, the frequency contents of a signal over a finite 
time interval. This information mav. however. be fairly hidden due to the typically 
erratic behavior of a periodogram as a function of w. We now seck a definition of a 
similar concept for signals over the interval ¢ € [1. oc). Preferably, such a concept 
should more clearly demonstrate the different frequency contributions to the signal. 

A definition for our framework is, however. not immediate. It would perhaps 
be natural to define the spectrum of a signal s as 


lim [Sx (œ) (2.57) 
Noo 


but this limit fails to exist for many signals of practical interest. Another possibility 
would be to use the concept of the spectrum. or spectral density. of a stationary 
stochastic process as the Fourier transform of its covariance function. However. the 
processes that we consider here are frequently not stationary. for reasons that are 
described later. We shall therefore develop a framework for describing signals and 
their spectra that is applicable to deterministic as well as stochastic signals. 


A Common Framework for Deterministic and Stochastic Signals 


In this book we shall frequently work with signals that are described as stochastic 
processes with deterministic components. The reason is. basically. that we prefer to 
consider the input sequence as deterministic. or at least partly deterministic, while 
disturbances on the system most conveniently are described by random variables. In 
this way the system output becomes a stochastic process with deterministic compo- 
nents. For (2.20) we find that 


Ey(t) = G(q)u(t) 


o {¥(t)} is not a stationary process. 
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To deal with this problem, we introduce the following definition. 


Definition 2.1. Quasi-Stationary Signals A signal {s(t)} issaid to be quasi-stationary 
if it is subject to 


(i) Es(t) = m, (t). nDl < C. Vi 
(2.58) 
(ii) Es(t)s(r) = R (t.r). IR ary < C 
E 
my 2 =r) = R(t) Yr (2.59) 


Here expectation E is with respect to the “stochastic components” of sit). If 
{s(f)} itself is a deterministic sequence. the expectation is without effect and quasi- 
stationarity then means that {s(7)} is a bounded sequence such that the limits 


N 
1 
Rt) = lim = X sust —t) 


Nox N 


i= 
exist. If {s(/)} isa stationary stochastic process, (2.58) and (2.59) are trivially satisfied. 


since then Es(t)s(t — t) = R,(t) does not depend on f. 
For easy notation we introduce the symbol E by 


; 

= l č 

Efi) = Jim = S Eft) (2.60) 
= - 4 i=l 


with an implied assumption that the limit exists when the symbol is used. Assumption 
(2.59). which simultaneously is a definition of R,(t). then reads 
Es(t)s(t — t) = R,(t) (2.61) 


Sometimes, with some abuse of notation, we shall call R, (T) the covariance function 
of s, keeping in mind that this is a correct term only if {s(t)} is a stationary stochastic 
process with mean value zero. 

Similarly, we say that two signals {s(¢)} and {w(r)} are jointly quasi-stationary 
if they both are quasi-stationary and if. in addition. the cross-covariance function 

Ry (T) = Es(thutt — t) (2.62) 

exists. We shall say that jointly quasi-stationary signals are uncorrelated if their 
cross-covariance function is identically zero. 
Definition of Spectra 
When limits like (2.61) and (2.62) hold. we define the (power) spectrum of {s(t)} as 


xX 
blo = D> Raet" (2.63) 


TSX 


and the cross spectrum between {s(t)} and {u(1)} as 


~N 
sulo) = YO Rae” (2.64) 


tT=—-™X 


provided the infinite sums exist. In the sequel. as we talk of a signal's “spectrum.” 
we always implicitly assume that the signal has all the properties involved in the 
definition of spectrum. 

While P, (w) always is real. ®,,,.(@) is in general a complex-valued function of 
w., Its real part is known as the cospectrum and its imaginary part as the quadrature 
spectrum, The argument arg ®,,,.(@) is called the phase spectrum. while |®,,(@)| 
is the amplitude spectrum. 

Note that. by definition of the inverse Fourier transform, we have 


T 


7 1 
Es-(t) = R,(0) = > PD, (widw (2.65) 
2H J-a 


Example 2.3 Periodic Signals 


Consider a deterministic. periodic signal with period M. i.e.. s(t) = s(t + M). We 
then have Es(t) = s(t) and Es(t)s(r) = s(t)s(r) and: 


y K-1 Lf 
ie MK 1 tz | 
x O e ee » F Xosa + EMs —t + &M) 
t=] t=} f=! 
i: 
+ a ae s(tis(t — T) 
ix=K M—} 


where K is chosen as the maximum number of full periods. i.e. N — MK < M. 
Due to the periodicity. the sum within braces in fact does not depend on £. Since 
there are at most M — | terms in the last sum. this means that the limit as N > x 
will exist with 


M 
= 1 i 
Es(t)st — t) = R(T) = 7 3 s(t)s(t — T) 


A periodic. deterministic signal is thus quasi-stationary. We clearly also have 
Rt +kM) = R(T). 
For the spectrum we then have: 


x x ç M-i 
P(w) = >: R, (Tje = > = Ry(t + EM je T pTi 
T=- {=-x T=} 


xX 
= P(w) y e ithe _ ®?(w)F(@. M) 


t=-xXxX 
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where 
M-1 e 
PP lw) = » RD, Flw, M) = > eTii Me 
T=) S 


The function F is not well defined in the usual sense. but using the Dirac delta- 
function. it is well known (and well in line with (2.39)) that: 


2 


Fiw,M) = 


Af 
H ae —2nk/M). O0<o@ < 2x (2.66) 
(= 


This means that we can write: 
2x M-I 
bo = >; ®! (27k/M)S(@ — 24k/M). 0 < @ < 27 (2.67) 


; k=0 


In b?(27k/M) we recognize the Å :th Fourier coefficient of the periodic signal R,(t). 
Recall that the spectrum is periodic with period 2x . so different representations of 
(2.67) can be given. 

The spectrum of a signal that is periodic with period M has thus (at most) M 
delta spikes. at the frequencies 277k /M. and is zero elsewhere. Although the limit 
(2.63) does not exist in a formal sense in this case. it is useful to extend the spectrum 
concept for signals with periodic and constant components in this way. C 


Example 2.4 Spectrum of a Sinusoid 
Consider again the signal (2.45). now extended to the interval [1. 90). We have 


N N 

le 1 3 l 

N ò Eutkyu(k — t} = N > A` coslwak) cos (wlk — T)). (2.68) 
k=) k=l 


(Expectation is of no consequence since u is deterministic.) Now 


I 
cos({wyk) cos (wolk — T)) = 3 (cos(2m)k — wot) + cos wyT) 


which shows that 


3 


æ 


Eu(t)u(t - r) = FOON = R, (tr) 


The spectrum now is 


X * I 


A- : A- 
®,(w) = >. = cos(@ te" Pa (lw — wy) + lw + wp)) +27 (2.69) 


-~ 


T=-X 


This result fits well with the finite interval expression (2.46) and the general expression 
(2.67) for periodic signals. C 
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Example 2.5 Stationary Stochastic Processes 


Let {u(t)} be a stationary stochastic process with covariance function (2.14). Since 
(2.59) then equals (2.14). our definition of spectrum coincides with the conventional 
one. Suppose now that the process v ts given as (2.9). Its covariance function is then 
given by (2.13). The spectrum is 


X x 
Db (w) = DA dee > h(k)htk ~ t) 
t=-x k=max(0.7) 


x x 
oe > 5 hike” hik na peit T 


TS—X $=max(i. ri 


x x 
= [k~r = s] AA hoi SO he = A He] 


s=th k=0 


using (2.18). This result is very important for our future use: 


The stochastic process described by u(t) = H(q)e(t), where {e(t)} is a 
sequence of independent random variables with zero mean values and covari- 


ances A. has the spectrum 


Prl) = a He)” (2.70) 


This result. which was easy to prove for the special case of a stationary stochastic 
process, will be proved in the general case as Theorem 2.2 later in this section. Figure 
2.7 shows the spectrum of the process of Figures 2.5 and 2.6. while the periodogram 
of the realization of Figure 2.6 is shown in Figure 2.8. o 


D. (%) 


0.1 l 


Figure 2.7 The spectrum of the process viry — 1.5et¢ — D + 0.7v — 2) = 
eit) +0.5etr — L). {e(t}} being white noise. 
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V.(@)! 


100 


i 


0.1 | 
Figure 2.8 The periodogram of the realization of Figure 2.6. 


Example 2.6 Spectrum of a Mixed Deterministic and Stochastic Signal 
Consider now a signal 


s(t) = u(t) + v(t) (2.71) 


where {u(t)} is a deterministic signal with spectrum ®,,(q@) and {v(t)} is a stationary 
stochastic process with zero mean value and spectrum ®,(w). Then 


Es(t)s(t — t) = Eu(tu(t — t) + Eu(t)v(t — t) 
+ Ev(t)u(t — t) + Ev(t)u(t — 7) 
= R,(t) + R(t) (2.72) 


since Ev(t)u(t — t} = 0. Hence 


D, (w) = bw) + Dw) (2.73) 


Connections to the Periodogram 


While the original idea (2.57) does not hold, a conceptually related result can be 
proved; that is, the expected value of the periodogram converges weakly to the 
spectrum: 


E |Sy (w)? 3 Psw) (2.74) 
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By this is meant that 


im J l E |Sy(w)|? U(w) dw = J D wV) dw (2.75) 


T 


for all sufficiently smooth functions Y (œw). We have 


Lemma 2.1. Suppose that {s(t )} is quasi-stationary with spectrum ®,(w). Let 


N 
1 
Sa (w) = UN 5 s(rew"@ 
t=1 


and let Y (w) be an arbitrary function for |w] < x with Fourier coefficients a, . such 
that 


x 


5 ja;| < 20 
t—-X 
Then (2.75) holds. 
Proof. 
1 N N 
P E a iwi{k—£) 
E |Sy()I? = nee SHO 
aa (2.76) 
=[€-k=rT1]= > Ry(t)e 
r=—(N—1) 
where 
pe 
Rs) = = > | Es(k)s(k — t) (2.77) 


k=l 


with the convention that s(k) is taken as zero outside the interval [1. N]. Multiplying 
(2.76) by Y (w) and integrating over [—7, 7 ] gives 


N-1 


f EISOP Vodo = YO Rya 


t=—-(N-1) 
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by the definition of a, . Similarly. allowing interchange of summation and integration, 
we have 


= x 
f &,(w)V(w)dw = X Ra 


a T=-X 


Hence 


x T 
| EISE Vodo — | ®,(w)V(w) dw 


N 


= JO a, [Ry(t) R+ Ý a, R(t) 


tT=-(N-1) Jr >N 


— 


Problem 2D.5 now completes the proof. Z 


Notice that for stationary stochastic processes the result (2.74) can be strength- 
ened to “ordinary” convergence (see Problem 2D.3), Notice also that. in our frame- 
work, results like (2.74) can be applied also to realizations of stochastic processes 
simply by ignoring the expectation operator. We then view the realization in ques- 
tion as a given “deterministic” sequence, and will then. of course, have to require 
that the conditions (2.58) and (2.59) hold for this particular realization [disregard 
“E” also in (2.58) and (2.59)]. 


Transformation of Spectra by Linear Systems 


As signals are filtered through linear systems, their properties will change. We saw 
how the periodogram was transformed in Theorem 2.1 and how white noise created 
Stationary stochastic processes in (2.70). For spectra we have the following generai 
result. 


Theorem 2.2. Let {u(t}} be a quasi-stationary signal with spectrum ®,,.(w). and 
let G(q) be a stable transfer function. Let 


sit) = G(q)u(t) (2.78) 


Then {s(t)} is also quasi-stationary and 


D, lw) = |C &,.(w) (2.79) 
Py (w) = Gle'?)®,.(w) (2.80) 
Proof. The proof is given in Appendix 2A. o 


Corollary. Let {y(t)} be given by 
y(t) = G(q)u(t) + H(q)e(t) (2.81) 
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where {u(t)} is a quasi-stationary. deterministic signal with spectrum ®,(@). and 
{e(t)} is white noise with variance A. Let G and H be stable filters. Then {¥(1)} is 
quasi-stationary and 


Dlo) = (Gle) Dulo) + Al ACE”), (2.82) 
Dulo) = Gle'’)®,(w) (2.83) 
Proof. The corollary follows from the theorem using Examples 2.5 and 2.6. o 


Spectral Factorization 


Typically. the transfer functions G(q) and H (q) used here are rational functions of 
q. Then results like (2.70) and Theorem 2.2 describe spectra as real-valued rational 
functions of e’” (which means that they also are rational functions of cos w). 

In practice, the converse of such results is of major interest: Given a spectrum 
®,.(w). can we then find a transfer function H(q) such that the process v(t) = 
H (q)e(t) has this spectrum with {e(1)} being white noise? It is quite clear that this 
is not possible for all positive functions ®,.(w). For example, if the spectrum is zero 
on an interval. then the function H(z) must be zero on a portion of the unit circle. 
But since by necessity H{z) should be analytic outside and on the unit circle for 
the expansion (2.18) to make sense, this implies that H(z) is zero everywhere and 
cannot match the chosen spectrum. 

The exact conditions under which our question has a positive answer are dis- 
cussed in texts on stationary processes. such as Wiener (1949)and Rozanov (1967). 
For our purposes it is sufficient to quote a simpler result, dealing only with spectral 
densities ®,.(w) that are rational in the variable e@ (or cos œ). 


Spectral factorization: Suppose that ®,.(w) > 0 is a rational function of cos 
w (or e'®). Then there exists a monic rational function of z. R(z), with no poles and 
no zeros on or outside the unit circle such that 


(wm) = A |R] 
The proof of this result consists of a straightforward construction of R, and it can be 
found in standard texts on stochastic processes or stochastic control (e.g.. Rozanov, 
1967; Aström. 1970). 
Example 2.7 ARMA Processes 


If a stationary process {v(t)} has rational spectrum ®,.(@). we can represent it as 


v(t) = R(q)elt) (2.84) 
where fe(t)} is white noise with variance A. Here R(q) is a rational function 
C(q) 
R(q) = — 
A(q) 


Ca) = 1+ cg + +++ + eng” 
A(q) =1+aqg + +++ + anq ™ 
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so that we may write 
u(t) +eaye(t — 1) +--+ + ay tlt — ng) 
= e(t) + celt — 1) +--+ cnet- n) (2.85) 


for (2.84). Such a representation of a stochastic process is known as an ARMA model. 
If ne = 0. we have an autoregressive (AR) model: 


v(t) + alt — 1) + +++ + an ult — na) = elt) (2.86) 

And if 2, = 0, we have a moving average (MA) model: 
u(t) = elt) + celt — 1) +--+ + ey elt — ne) (2.87) 
a 


The spectral factorization concept is important since it provides a way of representing 
the disturbance in the standard form v = H (q )e from information about its spectrum 
only. The spectrum is usually a sound engineering way of describing properties of 
signals: “The disturbances are concentrated around 50 Hz“ or “We are having low- 
frequency disturbances with little power over 1 rad/s.” Rational functions are able 
to approximate functions of rather versatile shapes. Hence the spectral factorization 
result will provide a good modeling framework for disturbances. 


Second-order Properties 


The signal spectra, as defined here, describe the second-order properties of the signals 
(for stochastic processes. their second-order statistics. i.e.. first and second moments). 
Recall from Section 2.1 that stochastic processes may have very different looking 
realizations even if they have the same covariance function (see Figures 2.5 and 2.6)! 
The spectrum thus describes only certain aspects of a signal. Nevertheless, it will 
turn out that many properties related to identification depend only on the spectra 
of the involved signals. This motivates our detailed interest in the second-order 
properties. 


2.4 SINGLE REALIZATION BEHAVIOR AND ERGODICITY RESULTS (x) 


All the results of the previous section are also valid. as we pointed out. for the special 
case of a given deterministic signal {s(¢)}. Definitions of spectra. their transforma- 
tions (Theorem 2.2) and their relationship with the periodogram (Lemma 2.1) hold 
unchanged: we may just disregard the expectation E and interpret E f (t) as 


1 
lim >) fu 
t=1 
There is a certain charm with results like these that do not rely on a probabilistic 
framework: we anyway observe just one realization, so why should we embed this 
observation in a stochastic process and describe its average properties taken over an 
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ensemble of potential observations? There are two answers to this question. One is 
that such a framework facilitates certain calculations. Another is that it allows us to 
deal with the question of what would happen if the experiment were repeated. 

Nevertheless, it is a valid question to ask whether the spectrum of the signal 
{s(t)}, as defined in a probabilistic framework, differs from the spectrum of the 
actually observed. single realization were it to be considered as a given, deterministic 
signal. This is the problem of ergodic theory. and for our setup we have the following 
fairly general result. 


Theorem 2.3. Let {s(t)} be a quasi-stationary signal. Let Es(t) = m(t). Assume 
that 


x 
s(t) — m(t) = v(t) = X hy (kelt — k) = H,(q)e(t) (2.88) 
k=0 


where {e(t}} is a sequence of independent random variables with zero mean values, 
Ee?(t) = 4,. and bounded fourth moments, and where {H, (q). t =1,2..... bisa 
uniformly stable family of filters. Then, with probability 1 as N tends to infinity, 


N 
> Dose - T) > Es(t)s(t — 7) = R,(t) (2.89a) 
7=] 
i 
y Pisom — t) — Es(t)m(t — 1)] > 0 (2.89b) 
t=] 
fae 
N Xoru — T) — Es(t)u(t — t)] — 0 (2.89c) 


=l 


The proof is given in Appendix 2B. 


The theorem is quite important. It says that, provided the stochastic part of 
the signal can be described as filtered white noise as in (2.88). then 


the spectrum of an observed single realization of {s(t)}, computed as fora 
deterministic signal, coincides, with probability 1, with that of the process 
{s(t)}, defined by ensemble averages (E) as in (2.61). 


This de-emphasizes the distinction between deterministic and stochastic signals 
when we consider second-order properties only. A signal {s(t)} whose spectrum is 
P, (w) = A may. for all purposes related to second-order properties. be regarded as 
a realization of white noise with variance A. 

The theorem also gives an answer to the question of whether our “theoretical” 
spectrum, defined in (2.63) using the physically unrealizable concepts of E and lim, 
relates to the actually observed periodogram (2.43). According to Theorem 2.3 and 
Lemma 2.1, “smoothed” versions of |Sy(@){" will look like ©,(w) for large N. 
Compare Figures 2.7 and 2.8. This link between our theoretical concepts and the 
real data is of course of fundamental importance. See Section 6.3. 
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2.5 MULTIVARIABLE SYSTEMS (+) 


So far, we have worked with systems having a scalar input and a scalar output. In 
this section we shall consider the case where the output signal has p components 
and the input signal has m components. Such systems are called multivariable. The 
extra work involved in dealing with models of multivariable systems can be split up 
into two parts: 


1. The easy part: mostly notational changes. keeping track of transposes. and 
noting that certain scalars become matrices and might not commute. 

2. The difficult part: multioutput models have a much richer internal structure, 
which has the consequence that their parametrization is nontrivial. See Ap- 
pendix 4A. (Multiple-input, single-output, MISO, models do not expose these 
problems.) 


Let us collect the p components of the output signal into a p-dimensional column 
vector v(t) and similarly construct an m-dimensional input vector u(t). Let the dis- 
turbance e(ż) also be a p-dimensional column vector. The basic system description 
then looks just like (2.20): 


y(t) = G(q)u(t) + H(qg)e(t) (2.90) 


where now G(q) is a transfer function matrix of dimension p x m and H(q) has 
dimension p x p. This means that the i. j entry of G(q), denoted by G;;(q). is 
the scalar transfer function from input number j to output number i. The sequence 
{e(t)} is a sequence of independent random p-d.mensional vectors with zero mean 
values and covariance matrices Ee(t)e’(t) = A. 

Now, all the development in this chapter goes through with proper interpreta- 
tion of matrix dimensions. Note in particular the following: 


e The impulse responses g(k) and h(k) will be p x m and p x p matrices. 
respectively, with norms 


1/2 


Nell = | do sl (2.91) 
ij 


replacing absolute values in the definitions of stability. 
o The definitions of covariances become [cf. {2.59)] 
Es(t)s'(t — t) = R,(t) (2.92) 
Es(t)w'(t — t) = R(T) (2.93) 


These are now matrices, with norms as in (2.91). 


e Definitions of spectra remain unchanged, while the counterpart of Theorem 
2.2 reads 


b, (w) = Gel”), (w)G"(e7!”) (2.94) 
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This result will then also handle how cross spectra transform. If we have 


y(t) = Goult) + H(g)ywtt) = [Gig) Hq)] E 


where u and w are jointly quasi-stationary, scalar sequences, we have: 
[ a iw) H( ion | ®,,(@) Pau (w) G(e7i”) 
e e l 
Puy(—w) P,(w) H(e7'”) 


[Gleit bulo) + |H Dato) 


P, (w) 


+ Ge) Pyy(w)H (e) + He!) Duu (—@)G(e"*) 
= |Get) bulo) + |H(e!) Dulo) 
+ 2Re (G(e'”) yy (w) H(e)) (2.95) 


where we used that G(e'”) and G(e~'”) are complex conjugates as well as 
®,,,,(@) and ®,,,.(—q@). The counterpart of the corollary to Theorem 2.2 for 
multivariable systems reads 


Dlo) = Gle!”),(w)Ge7'@) + Hel) AH e.) (2.96) 


e The spectral factorization result now reads: Suppose that ®,(q@) isa p x p 
matrix that is positive definite for all w and whose entries are rational functions 
of cos w (or e). Then there exists a p X p monic matrix function H(z) whose 
entries are rational functions of z (or z~') such that the (rational) function 
det H(z) has no poles and no zeros on or outside the unit circle. (For a proof, 
see Theorem 10.1 in Rozanov. 1967). 

e The formulation of Theorem 2.3 carries over without changes. (In fact. the 
proof in Appendix 2B is carried out for the multivariable case). 


2.6 SUMMARY 
We have established the representation 
y(t) = G(q)u(t) + H(qg)e(t) (2.97) 


as the basic description of a linear system subject to additive random disturbances. 
Here {e(t)} is a sequence of independent random variables with zero mean values 
and variances A (in the multioutput case. covariance matrices A). Also. 


Gq) = $ gkg 


k=1 


Hq) =1+ $ hkg 
k=l 
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The filter G(q) is stable if 
90 
X Igtk)| < œ 
k=} 


As the reader no doubt is aware of, other particular ways of representing linear 
systems, such as state-space models and difference equations. are quite common 
in practice. These can, however, be viewed as particular ways of representing the 
sequences {2(k)} and {h(k)}, and they will be dealt with in some detail in Chapter 4. 

We have also discussed the frequency function G(e'”), bearing information 
about how an input sinusoid of frequency w is transformed by the system. Frequency- 
domain concepts in terms of the frequency contents of input and output signals were 
also treated. The Fourier transform of a finite-interval signal was defined as 


1 N 
Unto) = Fe So uqe” (2.98) 
s=1 


A signal s(t} such that the limits 
Es(t)s(t — t) = R;(t) 


exist is said to be quasi-stationary. Here 
1 N 
Eft) = lim —) Ef(t 
Ff) = lim = 2 fo) 
Then the spectrum of s(t) is defined as 


O(a) = So R(t)" (2.99) 


t—-X 


For y generated as in (2.97) with {u(t)} and (e(t)} independent, we then have 


P lw) = |GE Dulo) +A JH] 


2.7 BIBLIOGRAPHY 


The material of this chapter is covered in many textbooks on systems and signals. For 
a thorough elementary treatment, see Oppenheim and Willsky (1983). A discussion 
oriented more toward signals as time series is given in Brillinger (1981), which also 
contains several results of the same character as our Theorems 2.1 and 2.2. 

A detailed discussion of the sampling procedure and connections between the 
physical time-continuous system and the sampled-data description (2.6) is given in 
Chapter 4 of Astrom and Wittenmark (1984). Chapter 6 of that book also contains 
an illuminating discussion of disturbances and how to describe them mathematically. 
The idea of describing stochastic disturbances as linearly filtered white noise goes 
back to Wold (1938). 
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Fourier techniques in the analysis and description of signals go back to the 
Babylonians. See Oppenheim and Willsky (1983), Section 4.0. for a brief histori- 
cal account. The periodogram was evidently introduced by Schuster (1894)to study 
periodic phenomena without having to consider relative phases. The statistical prop- 
erties of the periodogram were first studied by Slutsky (1929). See also Brillinger 
(1983). Concepts of spectra are intimately connected to the harmonic analysis of 
time series, as developed by Wiener (1930). Wold (1938), Kolmogorov (1941). and 
others. Useful textbooks on these concepts (and their estimation) include Jenkins 
and Watts (1968) and Brillinger (1981). Our definition of the Fourier transform (2.37) 
with summation from 1 to N and a normalization with 1//N suits our purposes, 
but is not standard. The placement of 27 in the definition of the spectrum or in 
the inverse transform. as we have it in (2.65). varies in the literature. Our choice is 
based on the wish to let white noise have a constant spectrum whose value equals 
the variance of the noise. The particular framework chosen here to accommodate 
mixtures of stochastic processes and deterministic signals is apparently novel, but 
has a close relationship to the classical concepts. 

The result of Theorem 2.2 is standard when applied to stationary stochastic 
processes. See, for example, James. Nichols, and Phillips (1947)or Åström (1970). 
The extension to quasi-stationary signals appears to be new. 

Spectral factorization turned out to be a key issue in the prediction of time 
series. It was formulated and solved by Wiener (1949)and Paley and Wiener (1934). 
The multivariable version is treated in Youla (1961). The concept is now standard in 
textbooks on stationary processes (see. e.g., Rozanov, 1967). 

The topic of single realization behavior is a standard problem in probability 
theory. See, for example, Ibragimov and Linnik (1971), Billingsley (1965), or Chung 
(1974)for general treatments of such problems. 


2.8 PROBLEMS* 


2G.1 Let s(t} be a p-dimensional signal. Show that 
os 5 1 f7 
E |s) = =J tr(,(w)) dw 
27 Jr 


2G.2 Let ®,(w) be the (power) spectrum of a scalar signal defined as in (2.63). Show that 
i. D,(w) is real. 
ii, D, (w) > OVw. 
iii, D,(—w) = P,(w). 


(t) 
2G.3 Let s(t) = [l =| and let its spectrum be 
u 


P l P (w) reed 


Puw) Oy (w) 


* See the Preface for an explanation of the numbering system. 
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2E.2 
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Show that ©, (w) is a Hermitian matrix: that is. 
D, (w) = Pw) 


where * denotes transpose and complex conjugate. What does this imply about the 
relationships between the cross spectra P,,,(w). Puy (w). and ®,,,(—w)? 


Let a continuous time system representation be given by 
ya) = G{pult) 


The input is constant over the sampling interval 7 . Show that the sampled input-output 
data are related by 


v(t) = Grlq)utt) 


where 


ix xT ] 1 
Gr(q) =| GAs)= ds 
5 5 q — 


=<-ix 
Hint: Use (2.5). 
A Stationary stochastic process has the spectrum 


Piwi 1.25 + cosw 
dw) = ——— 
: 1.64 + 1.6 cosw 
Describe {v(ż)} as an ARMA process. 
Suppose that {7(¢}} and {(r)} are two mutually independent sequences of independent 
random variables with 

Enit) = Eg) =0.  En(t) = àp 9 ER) = de 


Consider 


w(t) = nit) +E + ys — 1) 
Determine a MA(1) process 
u(t) = e(t) + celt — 1) 
where {e(t)} is white noise with 
Ee(t)=0. Eet) = ke 


such that {w'(t)} and {u(r)} have the same spectra: that is. find c and A, so that ®,.(w) = 
Piw). 


(a) In Problem 2E.2 assume that {n(t)} and {&(1)} are jointly Gaussian. Show that 
if {e(ż)} also is chosen as Gaussian then the joint probability distribution of the 
process {w(t)} [i. e-. the joint PDFs of w(t). w(t). .... u'(fp) for any collec- 
tion of time instances £; ] coincides with that of the process {u(r)}. Then, for all 
practical purposes. the processes {u(t)} and {w(t)} are indistinguishable. 
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(b) Assume now that (ft) € N(0. àn). while 


At 


E(t} = 4 _1. wo. as 
w.p 5 


0, w.p. I- he 
Show that, although v and w: have the same spectra, we cannot find a distribu- 
tion for e(t) so that they have the same joint PDFS. Consequently the process 


u'(t) cannot be represented as an MA(1) process. although it has a second-order 
equivalent representation of that form. 


2E.4 Consider the “state-space description” 
x(t +1) = fxd) + wt) 
v(t) = Ax(t) + v(t) 


where x. f. h. w. and v are scalars. {w(t)} and {v(t)} are mutually independent 
white Gaussian noises with variances R, and R», respectively. Show that y(t) can be 
represented as an ARMA process: 


v(t) ayy? — 1) +e + ayit — n) = elt) + celt — 1) +--- + cnet — n) 
Determine n. @,.¢,. and the variance of e(t) in terms of f, h. Ri, and Ra. What is the 
relationship between e(t), w(t), and v(t}? 


2E.5 Consider the system 
y(t) = Glg)ut) + v(t) 


controlled by the regulator 
u(t) = —Fs(q)¥(t) + Fi(g)r(t) 


where {r(r)} is a quasi-stationary reference signal with spectrum È, (w). The distur- 
bance {v(z)} has spectrum ,(w). Assume that {r{t)} and {v(t)} are uncorrelated and 
that the resulting closed-loop system is stable. Determine the spectra D,(w). P, (w). 
and ®,,,(w). 


2E.6 Consider the system 


d 
Wee + ay(t) = u(t) (2.100) 


Suppose that the input #(?) is piecewise constant over the sampling interval 
u(r) = iy, kT <t < (K+ 1)9T 


(a) Derive a sampled-data system description for ug. y(kKT). 

(b) Assume that there is a time delay of 7 seconds so that u(t) in (2.100) is replaced 
by u(t — T). Derive a sampled-data system description for this case. 

(c) Assume that the time delay is 1.57 so that u(t) is replaced by u(t — 1.57). Then 
give the sampled-data description. 
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2E.7 Consider a system given by 
v(t) + av(t — 1) = buit — l) + e(t) + celt — 1) 


where {u(t)} and {e(t)} are independent white noises, with variances u and 4. re- 
spectively. Follow the procedure suggested in Appendix 2C to multiply the system 
description by e(t), e(t — 1) u(t). u(t — 1). y(t). and y(t — 1}. respectively. and take 
expectation to show that 


Ryel0) =i, Ry) = (c — aja 
Ryu (0) = 0, Rul!) = bu 


by tat cA — 2acd 


R, 0) = 7 ' 
R i — a- 
R) = Ae — a + a?e — ač) — abu 


l-a 
2T.1 Consider a continuous time system (2.1): 
X 
yu) = f g(T)u(t — t)dt 
tT=0 
Let g7 (£) be defined by (2.5). and assume that u(t) is not piecewise constant. but that 


H 


d i 
Fak < Ci 


Let u; =u ((k + $)T) and show that 


x 
WAT) = È 8r (Duke + ri 
t=1 
where 
Ire] < Cp T? 


Give a bound for C3. 
2T.2 If the filters Ri(g) and R2(qg) are (strictly) stable, then show that Ri(q)R2(q) is also 
(strictly) stable {see also Problem 3D.1). 


2T.3 Let G(qg) be a rational transfer function: that is, 


b,g""! + anes + bn 


Goya -2i n 
ki q" tag? |) +--+ dn 


Show that if G{q) is stable, then it is also strictly stable. 
2T.4 Consider the time-varying system 
x(t +1) = Fot)x(t) + GUt)u(z) 
x(t) = A(@)x(t) 
Write ; 
yt) = $ gul — k) 


k=l 


2D.1 


2D.2 
2D.3 


2D.4 


2D.5 
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[take g: (k) = 0 for k > t]. Assume that 
F(t) > F. asi > x 


where F has all eigenvalues inside the unit circle. Show that the family of filters 
{g,(k) t =1,2.,...} is uniformly stable. 


Consider Uy: (w) defined by (2.37). Show that Uy (227 — w) = Uy(w) = U y(w) and 
rewrite (2.38) in terms of real-valued quantities only. 


Establish (2.39). 


Let {u(t}} be a stationary stochastic process with R,(t) = Eu(t)u(t — t). and let 
®,,(@) be its spectrum. Assume that 


x 
Sole R,(t)| < 00 
1 


Let Ux (w) be defined by (2.37). Prove that 
E|Ux(w)? > %,(@),  asN > 90 


This is a strengthening of Lemma 2.1 for stationary processes. 
Let G(q) be a stable system. Prove that 


N 


poy te 
im g 2 Ig = 0 


Hint: Use Kronecker’s lemma: Let az. b; be sequences such that a; is positive and 
decreasing to zero. Then La;,b, < œ implies 


N 
Jim ay 2 = 0 
(see. e.g.. Chung. 1974. for a proof of Kronecker's lemma). 
Let ba;(t) be a doubly indexed sequence such that, Vr, 
by(t) > b(t). as N — oO 


(but not necessarily uniformly in T}. Let a, be an infinite sequence. and assume that 
X 
Yolal<o. fb) <C vr 
1 


Show that 


N 
lim | J a (b(t) — b(t) + 3 a,b(t) | = 0 


Nox 
tT=—N IttoN 


Hint: Study Appendix 2A. 
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APPENDIX 2A: PROOF OF THEOREM 2.2 


We carry out the proof for the multivariate case. Let w(s) = 0 for s < 0. and 
consider 


N 
RYD = EE Esos — 0) 


t=1 


(2A.1) 
1 N t t-r 
=5 Soo gtkyEwsr — kyut — t — 8%) 
t=) k=0 ¢=0 
With the convention that w(s) = 0 if s ¢ [0. N]. we can write 
NOON 1 
NEY oc A ps PRES eco S 
R` (t) = Leoi Da kwit — t — Oge) (2A.2) 


If w(s) 40.5 < 0, then s(r) gets the negligible contribution 


ox 
s(t) = D> gk)w(t — k) 


k= 


N 
1 
RO = =D EwDw E — 1) 
t=1 


We see that R(t + £ — k) and the inner sum in (2A.2) differ by at most 
max (k, |t + £|) summands, each of which are bounded by C according to (2.58). 
Thus 


N 
N 1 T 
RX(t +0 —k) meee — kw — r — £) 
max (k. |r + £l) C 
C———— < — (k £ 2A.3 
< N < N | + |r + £l) (2A.3) 
Let us define 


R,(t) = YeR, + £ — kgt(é) (2A.4) 


k=0 ¢=0 
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Then 


x 


x 2 
R(t) -— R(t) Ss Do YS Iga Ru(t + € — k) 


kK=N+] €=N-1 


N N 
+ SY elle Rule + € k) — RY + € k)| 


k=0 f=0 


C N N 
+ N 2K BOI -$ Ig(al 


t=) 


C N N 
+ © Dole + tlg - Dg (2A.5) 


é=0 k=0 


The first sum tends to zero as N — œ since |R,.(t)| < C and G(q) is stable. It 
follows from the stability of G(g) that 


N 
a 
yki > 0, aN > o (24.6) 
A=0 


(see Problem 2D.4). Hence the last two sums of (2A.5) tend to zero as N > x. 
Consider now the second sum of (2A.5). Select an arbitrary € > 0, and choose 
N = N, such that 


o oC 
E 
XO lel < —— where Cy = > lg) (2A.7) 
k=N,+1 [C - Ci] k=0 
This is possible since G is stable. Then select N; such that 
N E€ 
max |Ru(t + € k) — Rut + €—k)| < a 


for N > N,. This is possible since 


R(t) > R(t).  asN > oc (2A.8) 


54 Chap.2 Time-Invariant Linear Systems 


(w is quasi-stationary) and since only a finite number of R,,(s)'s are involved (no 
uniform convergence of (2A.8) is necessary). Then, for N > N. we have that the 
second sum of (2A.5) is bounded by 


YS iecenscor- 2 =z t > S lglg 2C 


k=0 €=0 k=N,+1 €=0 


as 5 D lelg- 2C 


k=0 =N: +1 


which is less than 52, according to (2A.7). Hence, also, the second sum of (2A.5) 
tends to zero as N — œ, and we have proved that the limit of (2A.5) is zero and 
hence that s(t) is quasi-stationary. 


The proof that Es(t)w(t — t) exists is analogous and simpler. 
For ®,(w) we now find that 


x x x 
D, (w) 5 (E Erori ppn oso) e ite 


t=- \k=0 €=0 


oC x ; x i f , 
> d- atkye* ) > Ry(t +£- ke TiC tE- oTe te 


t=-X k=0 @=0 

=([t+é€—-—k=s] 

= S^ eke : D Ry (se i8® . Eroe“ 
=0 S=-X 


Gleo, (wG e2) 


Hence (2.79) is proved. The result (2.80) is analogous and simpler. 
For families of linear filters we have the following result: 


Corollary. Let {Ga(q), 9 € D} be a uniformly stable family of linear filters, and 
let {w(t)} be a quasi-stationary sequence. Let se(t) = Ge(q)w(t) and R(t. 9) = 
Esg(t)s} (t — Tt). Then: 


> 0 asN > x 


1 N 
x X solt)si (t — t) — R(t, 8) 


i=l 


sup 
6eD 
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Proof. We only have to establish that the convergence in (2A.5) is uniform in 
0 e D. In the first step all the g(k)-terms carry an index @ : g9(k). Interpreting 


g(k) = sup |ge(k)| 
0eD 


(2A.5) will of course still hold. Since the family Gg(q) is uniformly stable. the sum 
over g(k) will be convergent, which was the only property used to prove that (2A.5) 
tends to zero. This completes the proof of the corollary. o 


APPENDIX 2B: PROOF OF THEOREM 2.3 


In this appendix we shall show a more general variant of Theorem 2.3. which will be 
of value for the convergence analysis of Chapter 8. We also treat the multivariable 
case. 


Theorem 2B.1. Let {Gs(g).@ € Da} and {Me(q)}.0 € De} be uniformly stable 
families of filters, and assume that the deterministic signal {w(t)}.f = 1. 2..... is 
subject to 


jw(t)| < Cy. Yt (2B.1) 
Let the signal sẹ (t) be defined. for each 0 € Dg, by 
solt) = Golqg)u(t) + Mo(q)wi(r) (2B.2) 


where {u'(r)} is subject to the conditions of Theorem 2.3 (see (2.88) and let Ee(t)e"(t) 
= A; ). Then 


N 


l 
x Y [sasi (t) — Esos? (2) 


f=1 


sup 
9€EDa 


—> Owp.l. as N > x (2B.3) 


Remark. We note that with dims = 1, De = {(@*} (only one element). 
G3(q) =1, Mj(q) = 1 and w(t) = m(t), then (2B.3) implies (2.89). With 


s(t) 1 w(t) 
sst) = | mit) | = | O | v(t) + | wlt) (2B.4) 
v(t) 1 0 


the different cross products in (2B.3) imply all the results (2.89). 


To prove Theorem 2B.1, we first establish two lemmas. 
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Lemma 2B.1. Let {v(z)} obey the conditions of Theorem 2.3 and let 
XxX 
CH = Š sup |h; (k). C; = sup È leit. Cy = sup|w(r)| 
t=1 ! 1 t 


Then, forall r, N. k. and /, 


N rs 
EV) [ue — kyl — 0) -— Evt — kyr - O] 
t=r 
<4-C,:Cy:(N—-?r) (2B.5) 
N 2 
El) vt — kut — I) 
f=r 
< 4-Ce: Ch CŒ- (N -r) (2B.6) 


Proof of Lemma 2B.1. With no loss of generality, we may take k = 1 = 0. We 
then have 


N N œ o 
SÈ YOH) — Evae) = YY Y halt, k DATE (2B7) 
=r t=r k=0 €=0 
where 
alt k, & = elt — Ket — £) — Ar-eôke (2B.8) 


For the square of the 7, j entry of the matrix (2B.7), we have 


N N x & aS 


(Sha DP = VV VD DY DY vit. s, kkt, b) 


f=r s=r ki =0 €;=0 k2=0 €;=0 
with 
y(t, S. ki. ko, bi. £5) 
(i) (j) Toi (j) T 
= h kalt. ki, £1) [hi | h (ka)als. kz. £2) [hs £a] 


Superscript (7) indicates the ith row vector. Since {e(t)} isa sequence of independent 
variables, the expectation of y is zero, unless at least some of the time indexes 
involved in a(t, kı, £1) and a(s, k2, £2) coincide, that is, unless 


t—-k =s-k or t-k =s—@2 or t-—€; =s—kp or t—£,; =s-—£) 
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For given values of f, kı. k2. €;. and € this may happen for at most four values of 
s. For these we also have 
Ey(t.s, ky. kz. 6&1, €2) < Ce - |h(ki)| - la (ke)] - 1A (Er) + (a(n) 
Hence 


E(S¥G. DY < Sb J atk nen 


ki=0 é,;=0 


x N 
» So late D0 4+ Ce 5 4+ CCH r) 


f.=0 f=r 
which proves (2B.5) of the lemma. The proof of (2B.6) is analogous and simpler. = 


Corollary to Lemma 2B.1. Let 


w(t) = J aket- k) vO) = Y Aet- k) 


k=0 k=0 
Then 


N 
E |$ we) — Eww < C-C- C- (N - 1) 


t=r 


oc x 
= X sup læ (K). C= $ sup |B; (k) 
k=0 d 


k=0 


Lemma 2B.2. Let 


N 
RY = sup |Y so(t)sg (t) — Eso(t)sy (D) (2B.9) 
@EDa t=r 
Then 
E(RX) < C(N — 1) (2B.10) 
Proof of Lemma 2B.2. First note the following fact: If 
x 
o = J atk)z(k) (2B.11) 


k=0 


where {a(k)} is a sequence of deterministic matrices, such that 


$ lak) < Ca 


k=0 
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and {z(k)} is a sequence of random vectors such that E |z(k)|" < C- then: 


Ely? = Sy UUR (k)z"(€)a7(t)] 


k=0 ¢=0 


TE [E oF]? - [EOF]? + lati (2B.12) 


k=0 &=0 


cÈ iw] < Ore 


k=0 


ae 


: 


Here the first inequality is Schwarz’s inequality. We now have 


N 
Ra(N.r) = D [sa(t)s7O(t) — Esalt)s (r)] 
f=r 
N x x E 
=} J gW [ve -Ot — © — Evt — kwa — Oo] oF © 
t=r k=0 &=0 
N x x 
+ `> > X gokul — kjw t — Emi (O 
t=r k=0 &=0 
N xX xX 
+ YOY Y mkwl — KM — Og O (2B.13) 
=r k=0 t=0 
This gives 
x x 
sup |Re(N.r)Il < $ 9 sup Ige (KI - sup lan(el + Sk. 6 
k=0 (=0 


(2B.14) 


x x : 
+ 2) 2 sup iget sup lma (ÐI - |S, 


k=0 t=0 
with SY and S* defined by 
; N 
Sik.) = Do [oe -Tt — © — Eve ~ Wer - 6] 
t=r 
N 


ŠNO = X ve — ele - 8) 


f=r 
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Since Ga(q) and M,(q) are uniformly stable families of filters, 


x 
sup iga < a(k). $ 80W = Cg < o. 
i k=t 
X 
sup ||ma (k) < mtk}, X mnik) = Cy < X 
o k=1 


Applying (2B.11) and (2B.12) together with Lemma 2B.1 to (2B.14) gives 


7 


e 


E [sup IRN. P| <2.C$-4. Ce- Ch N -r) 
4 


+ 8-Cz-Ci,-4-C2-Ci-(N—1r) < C-(N—1) 


(J 


which proves Lemma 2B.2, 


We now turn to the proof of Theorem 2B.1. Denote 
r(t.0) = seltsi (t) — Esg(t)sg (t) (2B.15) 


and let 
RY = sup |lRe(N.r)|| (2B.16) 
BEDA 


with Re(N.r) defined by (2B.13). According to Lemma 2B.2 


i os i RYN 
p (Ri > e) < ze( >) 
N- £- Ne 


Hence: 


k=l 


which, via Borel-Cantelli’s lemma [see {(1.18)], implies that 


1 a: 
aki — 0. w.p. 1 ask > X (2B.17) 
Now suppose that 
l k 
sup -Ri 
NI<ksIN HIY k 
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is obtained for k = kẹ and 0 = Oxy. Hence 


hos 1 |é 
Pe ae A a7 276-58) 
1 N? 1 ky 
= 7. D Tipe ee (2B.18) 
sa Eti e 


Since ky > N?. the first term on the right side of (2B.18) tends to zero w.p.1 in view 
of (2B.17). For the second one we have, using Lemma 2B.2, 


1 ke 1 k 2 
E \—R,* < —. max Rus 
ky NHI N4 ese el 
(N+1)* (NHF 
] k 2 1 > C 
k=N?41 k=N-+4+] 


which using Chebyshev’s inequality (1.19) and the Borel-Cantelli lemma as before, 
shows that also the second term of (2B.18) tends to zero w.p.1. Hence 


1 , 
sup — Rt > 0, wp.l. asN = œ (2B.19) 
N2<k<(N-vl)? 


which proves the theorem. 
Corollary to Theorem 2B.1. Suppose that the conditions of the theorems hold. but 
that (2.88) is weakened to 


E [el)le — 1),....e(0)] = 0, Ele*(le@ — 1).....e@] = à 


E le) < C 


Then the theorem still holds. [That is: {e(t)} need not be white noise, it is sufficient 
that it is a martingale difference.] 


Proof. Independence was used only in Lemma 2B.1. It is easy to see that this lemma 
holds also under the weaker conditions. C 
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APPENDIX 2C: COVARIANCE FORMULAS 


For several calculations we need expressions for variances and covariances of signals 
in ARMA descriptions. These are basically given by the inverse formulas 


if l 

R(t) = f D (we dw (2C.1a) 
20 J-a 
a i l 

Rs (t) = — ®,,.(w)e'* dw (2C.1b) 
2m Jy 


With the expressions according to Theorem 2.2 for the spectra, (2C.1) takes the form 


À x Cle”) ig 
R,(t) = —| 
Ix Nee Ale”) | 


à f CCA} z 
= 2... rar a 2C.2 
2x J AA(1/2) l ) 
foran ARMA process. The last integral is a complex integral around the unit circle, 
which could be evaluated using residue calculus. Aström. Jury, and Agniel (1970)(see 
also Aström. 1970, Ch. 5) have derived an efficient algorithm for computing (2C.2) 
for r = Q. It has the following form: 


A(z) = az” + az”! +--+ + an. C(z) = con” +a t-te 


Let a” = a; and ¢? = c; and define af, c recursively by 


"dw = z= ef] 


n—k+1 on—-k+l n—k+1 on—k+1 
n-k ay ai — An-k+1În-k+1—i 
a = SS a 
n=k+l 
ag 
n=k+1 „n—-k+1 n—k+1 n=-k+1 
nk ag C; — Cn ha nko i 
i i n—k+1 
ag 
i = 0,1..... n—k, k = 1,2,...,n 
Then for (2C.2) 
oe (cky? 
R(0) = — (2C.3) 
0 ko 0 


An explicit expression for the variance of a second-order ARMA process 
y(t) + ay(t — 1) + a:y(t — 2) = e(t) + cye(@t — 1) + cre(t — 2) 
Ee(t)= 1 (2C.4) 
is 
(1 +a) (1 + (61} + (€2)?) —2a1c1ı(1 + c2) — 2¢2 (a2 — (a1)? + (a2)°) 


Var y(t) = 
ar y(r) (1 —a)(1 ~ ay + a2)(1 +a +a) 


(2C.5) 


62 Chap. 2  Time-Invariant Linear Systems 


To find R,(t) and the cross covariances R,-(t) by hand calculations in simple ex- 
amples, the easiest approach is to multiply (2C.4) by e(t), e(t — 1), e(f — 2). y(t). 
y(t — 1), and y(t — 2) and take expectation. This gives six equations for the six 
variables Rye(t), Ry(t), or t = 0, 1, 2. Note that Rye(t) =O fort < 0. 

For numerical computation in MATLAB it is easiest to represent the ARMA 
process in state-space form. with 


[y@-1) o yan) et- n etn] 


as states, and then use dlyap to compute the state covariance matrix. This will 
contain all variances and covariances of interest. 
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SIMULATION AND PREDICTION 


The system descriptions given in Chapter 2 can be used for a variety of design prob- 
lems related to the true system. In this chapter we shal} discuss some such uses. The 
purpose of this is twofold. First, the idea of how to predict future output values will 
turn out to be most essential for the development of identification methods. The 
expressions provided in Section 3.2 will therefore be instrumental for the further 
discussion in this book. Second, by illustrating different uses of system descriptions, 
we will provide some insights into what is required for such descriptions to be ad- 
equate for their intended uses. A leading idea of our framework for identification 
will be that the effort spent in developing a model of a system must be related to the 
application it is going to be used for. Throughout the chapter we assume that the 
system description is given in the form (2.97): 


y(t) = G(q)u(t) + Hiqet) (3.1) 


3.1 SIMULATION 


The most basic use of a system description is to simulate the system's response 
to various input scenarios. This simply means that an input sequence u*(t). t = 
A N , chosen by the user is applied to (3.1) to compute the undisturbed out- 
put 

y*(t) = G(q)u* (t), Ea ere l (3.2) 


This is the output that the system would produce had there been no disturbances. 
according to the description (3.1). To evaluate the disturbance influence, a random- 
number generator (in the computer) is used to produce a sequence of numbers e* (t). 


t = 1,2,..., N, that can be considered as a realization of a white-noise stochastic 
process with variance A. Then the disturbance is calculated as 
u(t) = H(qg)e"(t) (3.3) 


By suitably presenting y*(¢) and v*(t) to the user, an idea of the system's response 
to {u*(t)} can be formed. 
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This way of experimenting on the model (3.1) rather than on the actual, physical 
process to evaluate its behavior under various conditions has become widely used 
in engineering practice of all fields and no doubt reflects the most common use 
of mathematical descriptions. To be true. models used in, say, flight simulators or 
nuclear power station training simulators are of course far more complex than (3.1). 
but they still follow the same general idea (see also Chapter 5). 


3.2 PREDICTION 


We shall start by discussing how future values of v(t) can be predicted in case it is 
described by 


wr) = Hiet) = X Ahel — k) (3.4) 
k=0 


For (3.4) to be meaningful, we assume that H is stable: that is, 


x 
y |h(k)| < x (3.5) 
k=0 
Invertibility of the Noise Model 


A crucial property of (3.4). which we will impose. is that it should be invertible: that 
is. if u(s), s < t, are known, then we shall be able to compute e(t) as 


x 
elt) = H(q)v(t) = Saget —k) (3.6) 
k=0) f 


with 
fe ~ 
y |i w) <æ 
k=0 


How can we determine the filter H (q) from H(q)? The following lemma gives the 
answer. 


Lemma 3.1. Consider {v(t)} defined by (3.4) and assume that the filter H is stable. 
Let 


H(z) = $ h(k) (3.7) 


k=0 


and assume that the function 1/ H(z) is analytic in |z| > 1: 


= SS Alk) (3.8) 


HO 
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Define the filter H~!(q) by 
IG 
Hq) = Y hkg (3.9) 
k=0 
Then Ĥ (q) = H7! (q) satisfies (3.6). 


Remark. That (3.8) exists for |z| > 1 also means that the filter H~! (q) is 
stable. For convenience, we shall then say that H (q) is an inversely stable filter. 


Proof. From (3.7) and (3.8) it follows that 


x x x ë é 
1 = $Y khl) = [kts = e= YOYO hohe- k 
k=0 s=0 é=0) k=0 


which implies that 


é 
s ~ fl ife =0 
POAR — k) = lo. tae (3.10) 
k=0 
Now let {u(t)} be defined by (3.4) and consider 

xX X x x 

Yate -k = AG) Y hoet — k - s) 

k=) k=0 =0 


x 


A(k)h(slet — k — s) = [k +s = 4 


i 
Me: 


x t 
D [Zion x J e(t — €) = e(t) 


k=0 


according to (3.10), which proves the lemma. Z 


Note: The lemma shows that the properties of the filter H (q) are quite analo- 
gous to those of the function H(z). Itis not a triviality that the inverse filter H = (q) 
can be drived by inverting the function H(z): hence the formulation of the result as 
a lemma. However, all similar relationships between H (q) and H(z) will also hold, 
and from a practical point of view it will be useful to switch freely between the filter 
and its z-transform. See also Problem 3D.1. 

The lemma shows that the inverse filter (3.6) in a natural way relates to the 
original filter (3.4). In view of its definition, we shall also write 


1 
H-'(q) = —— 3.11 
(q) HQ) (3.11) 
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for this filter. All that is needed is that the function 1 / H(z) be analytic in jz| > 1: that 
is, it has no poles on or outside the unit circle. We could also phrase the condition as 
H (z) must have no zeros on or outside the unit circle. This ties in very nicely with the 
spectral factorization result (see Section 2.3) according to which, for rational strictly 
positive spectra, we can always find a representation H (q) with these properties. 


Example 3.1 A Moving Average Process 


Suppose that 


u(t) = elt) + celt — 1) (3.12) 
That is, 
H(q) = 1+ cq”! 
According to (2.87), this process is a moving average of order 1, MA(1). Then 
z+c 
H(z) = 1 +c! = 
has a pole in z = 0 and a zero in z = —c, which is inside the unit circle if |c| < 1. If 


so, the inverse filter is determined as 


Hols): = b _ pk ek 
ay I = = oye 


and e(t) is recovered from (3.12) as 


x 
et) = X (-c)* u(t — k) 
k=0 CO 
One-step-ahead Prediction of v 


Suppose now that we have observed vu(s) for s < t — 1 and that we want to predict 
the value of u(t) based on these observations. We have, since H is monic, 


v(t) = do rett — k) =e(t)+ X hlel — k) (3.13) 


k=0 k=1 


Now, the knowledge of u(s), s < t — 1 implies the knowledge of e(s).s < t —1, 
in view of (3.6). The second term of {3.13) is therefore known at time f — 1. Let us 
denote it, provisionally, by m(t — 1): 


~< 
m(t - 1) = Do Alkyete —k) 


k=1 


Suppose that {e(t)} are identically distributed, and let the probability distribution of 
e(t) be denoted by f,(x): 


P(x < e(t) < x + Ax) > fe(x)Ax 
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This distribution is independent of the other values of e(s), s Æ t. since {e(t)} isa 
sequence of independent random variables. What we can say about v(t) at time t — 1 
is consequently that the probability that v(z) assumes a value between m(t — 1) +x 
and m(t —1) + x + Ax is f(x) Ax. This could also be phrased as 


the (posterior) probability density function of v(t), given observations up 
to time t —1,is f.(x) = fe (x — m(t — 1)). 


Formally, these calculations can be written as 


P(x < v(t) < x + Axpi) 


2 


f(x) Ax 


P(x <mt —1) + e(@) <x + Ax) 


P(x —m(t — 1) < e(t) 
fe (x — m(t — 1)) Ax 


<x+ Ax — m(t — 1)) 


& 


Here P(A|vt}) means the conditional probability of the event A, given v7}. 


This is the most complete statement that can be made about v(t) at time t — 1. 
Often we just give one value that characterizes this probability distribution and hence 
serves as a prediction of v(t). This could be chosen as the value for which the PDF 
fe (x — m(t — 1)) has its maximum, the most probable value of u(t), which also is 
called the maximum a posteriori (MAP) prediction. We shall, however, mostly work 
with the mean value of the distribution in question, the conditional expectation of 
u(t) denoted by t(t|t — 1). Since the variable e(t) has zero mean, we have 


x 
ijt —1) = m(t -— 1) = X hlk)ett — k) (3.14) 
k=l 


It is easy to establish that the conditional expectation also minimizes the mean-square 
error of the prediction error: 


min E (v(t) — 8Y => 00) = lle — 1) 
riz) 


where the minimization is carried out over all functions u(t) of Te See Problem 
3D.3. 

Let us find a more convenient expression for (3.14). We have, using (3.6) and 
(3.11). 


vit|f — 1) = bp naoa! elt) = [H(q) — 1]e(t) 

a (3.15) 

v(t) = [1 = H~! (q)] v(t) = $ lovt — k) 
k=l 


2 H(q) -1 
H (q) 
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Applying H (q) to both sides gives the alternative expression 
x 
Hèl — 1) = [H — Wet) = Datu — k) (3.16) 
k=1 


Example 3.2 A Moving Average Process 
Consider the process (3.12). Then (3.16) shows that the predictor is calculated as 
u(t|t — 1) + ct(t — It — 2) = cvtt — 1) (3.17) 


Alternatively we can determine H`! (q) from Example 3.1 and use (3.15): 


o(t|tf — 1) = -$ \(-cy u(t — k) 
k=l 


2 
Example 3.3 An Autoregressive Process 
Consider a process 
x 
u(t) = X atei — k). jal < 1 
k=0 
Then 
x 1 
re} k-k _ 
H(z) = ia in re 
k=0 
which gives 
H(z) = 1 — az™ 
and the predictor, according to (3.15), 
èllt — 1) = av(t — 1) (3.18) 
o 


One-step-ahead Prediction of y 


Consider the description (3.1), and assume that y(s) and u (s) areknownfors < t—1. 
Since 

v(s) = y(s) — G(g)u(s) (3.19) 
this means that also u(s) are known for s < t — 1. We would like to predict the 
value 

ya) = G(q)u(t) + v(t) 
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based on this information. Clearly, the conditional expectation of y(7). given the 
information in question. is 


salt — 1) = Glq)u(t) + v(t|t — 1) 
= Gut) + [1 - Ho] eo 
G(q)u(t) + [1 — H™'(q)] [vit) — G(q)u(r)] 


il 


using (3.15) and (3.19). respectively. Collecting the terms gives 


Sle — 1) = HG u) + [1 — Ag] yi) (3.20) 


Hi(q)¥(t\t — 1) = Gig)u(t) + [H(g) — 1] y(t) (3.21) 


Remember that these expressions are shorthand notation for expansions. For exam- 
ple. let {€(k)} be defined by 


GO oo A 
—— = Elk): (3.22) 
H(z) 2 i 


[This expansion exists for |z| > 1 if H(z) has no zeros and G (z) no poles in |z| > 1.] 
Then (3.20) means that 


$e- 1) = Y eur — k) + YO -ikra — k) (3.23) 
k=l k=1 


Unknown Initial Conditions 


In the reasoning so far we have made use of the assumption that the whole data 
record from time minus infinity to f — 1 is available. Indeed. in the expression (3.20) 
as in (3.23) all these data appear explicitly. In practice. however, it is usually the case 
that only data over the interval [0.  — 1] are known. The simplest thing would then 
be to replace the unknown data by zero (say) in (3.23): 


t t 
Sal D ~ Y Ekul — ky + DO Aloy — k) (3.24) 
k=] k=l 
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One should realize that this is now only an approximation of the actual condi- 
tional expectation of y(r}. given data over [0, t — 1]. The exact prediction involves 
time-varying filter coefficients and can be computed using the Kalman filter [see 
(4.94)]. For most practical purposes, (3.24) will, however, give a satisfactory solution. 
The reason is that the coefficients {€(k)} and {A(k)} typically decay exponentially 
with k (see Problem 3G.1). 


The Prediction Error 


From (3.20) and (3.1), we find that the prediction error v(t) — ¥(t|f — 1) is given by 


y(t) — lt — 1) = -HaGo ult) + Hq) = elt) (3.25) 


The variable e(t }) thus represents that part of the output y(t) that cannot be predicted 
from past data. For this reason it is also called the innovation at time t. 


k-step-ahead Prediction of y (*) 


Having treated the problem of one-step-ahead prediction in some detail, it is easy to 
generalize to the following problem: Suppose that we have observed v(s) fors < t 
and that we want to predict the value v(f + k). We have 


vlt +k) = Y helt +k - £) 


£=0 
k-1 X f 
= Yi Aett ik =f). X a(bett +k- (3.26) 
é=0 f=k 
Let us define 
— k—1 $ 5c 
HD = J AOT Ao = Doh (3.27) 
=0 é=k 


The second sum of (3.26) is known at time time ¢, while the first sum is independent 
of what has happened up to time ¢ and has zero mean. The conditional mean of 
u(t + k). given vt is thus given by 


Br + kit) = She +k — 2) = Alget) = Ak(q) Hwe) 
é=k 


This expression is the k-step-ahead predictor of v. 


Sec. 3.2 Prediction 71 


Now suppose that we have measured y’ ,. and know MA and would like to 


predict y(t + k). We have. as before 
y(t + k) = G(qju(t +k) + vlt + k) 


which gives 


|| > 


S + kly in wt) = S(t + klt) = Glqhudt +k) + OG + klt) 
G(qg)u(t + k) + Hlg) Hawe) (3.28) 
G(qu(t + k) + AHTa iya) — Giq)u(t)] 


Introduce 


itt 


wila) = 1 — qtq) Hq) = [Ba — q x(q) | H7\(q) 


(3.29) 


Hlg) Hg) 
Then simple manipulation on (3.28) gives 
SE + KID = WilgGq@uert+kh + ADH Oy (3.30) 


or, using the first equality in (3.29), 


Filt — k) = Wi(g)G(q)u(t) + [1 — Wa) yl) (3.31) 


This expression. together with (3.27) and (3.29), defines the k-step-ahead pre- 
dictor for y. Notice that this predictor can also be viewed as a one-step-ahead 
predictor associated with the model 


y(t) = Gult) + Weigelt) (3.32) 


The prediction error is obtained from (3.30) as 
et +k) È vit +k) — Mr + klt) = -Wilg G (qult + k) 


+ [at - ioa] 
(3.33) 
Wilg) [y(t + k) — Giqju(t + k)) = Wi(q)H(q)elt + k) 


Hy (q)e(t +k) 


Here we used (3.29) in the second and fourth equalities. According to (3.27), Hix (q) 
is a polynomial in g~! of order k — 1. Hence the prediction error is a moving average 
of e(t + k)... elt +1). 


72 Chap. 3 Simulation and Prediction 


The Multivariable Case (*) 


For a multivariable system description (3.1) (or 2.90). we define the p x p matrix 
filter H~'(q) as 


H'a) = Yo hkg 


k=0 


Here h(k) are the p x p matrices defined by the expansion of the matrix function 
x 
[HOD = So hk) (3.34) 
k=0 


This expansion can be interpreted entrywise in the matrix [H(z)]~! (formed by 
standard manipulations for matrix inversion). It exists for |z| > 1 provided the 
function det H(z) has nozerosin |z| > 1. With HT! (q) thus defined, all calculations 
and formulas given previously are valid also for the multivariable case. 


3.3 OBSERVERS 


In many cases in systems and control theory, one does not work with a full description 
of the properties of disturbances as in (3.1). Instead a noise-free or “deterministic” 
model is used: 

y(t) = G(q)u(t) (3.35) 


In this case one probably keeps in the back of one’s mind. though, that (3.35) is not 
really the full story about the input-output properties. 

The description (3.35) can of course also be used for “computing,” “guessing.” 
or “predicting” future values of the output. The lack of noise model. however. 
leaves several possibilities for how this can best be done. ‘The concept of observers 
is a key issue for these calculations. This concept ts normally discussed in terms of 
state-space representations of (3.35) (see Section 4.3): see. for example, Luenberger 
(1971)or Astrém and Wittenmark (1984), but it can equally well be introduced for 
the input-output form (3.35). 


An Example 
Let 
x l b27? 
Go) =b) (0 = a (3.36) 
k=1 N 


This means that the input-output relationship can be represented either as 


y0) = b> (af ut — k) (3.37) 
k=1 


that is i 


b 
y) = uto) 
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or aS 
(1 —aq™)y(t) = bq7 u(t) 


v(t) — av(t — 1) = butt — 1) (3.38) 


Now, if we are given the description (3.35) and (3.36) together with data \(s), u(s). 
s <¢—1.and are asked to produce a “guess” or to “calculate” what y(r) might be, 
we could use either 


$al- 1) = b> (a) ut — k) (3.39) 
k=1 
or 
Flt — 1) = ay(t — 1) + bult — 1) (3.40) 


As long as the data and the system description are correct, there would also be 
no difference between (3.39) and (3.40): they are both “observers” (in our setting 
“predictors” would be a more appropriate term) for the system. The choice between 
them would be carried out by the designer in terms of how vulnerable they are to 
imperfections in data and descriptions. For example, if input-output data are lacking 
prior to time s = 0, then (3.39) suffers from an error that decays like a’ (effect of 
wrong initial conditions), whereas (3.40) is still correct for t > 1. On the other hand, 
(3.39) is unaffected by measurement errors in the output, whereas such errors are 
directly transferred to the prediction in (3.40). From the discussion of Section 3.2, it 
should be clear that, if (3.35) is complemented with a noise model as in (3.1), then 
the choice of predictor becomes unique (cf. Problem 3E.3). This follows since the 
conditional mean of the output. computed according to the assumed noise model. is 
a uniquely defined quantity. 


A Family of Predictors for (3.35) 


The example (3.36) showed that the choice of predictor could be seen as a trade-off 
between sensitivity with respect to output measurement errors and rapidly decaying 
effects of erroneous initial conditions. To introduce design variables for this trade-off, 
choose a filter W(q) such that 


W(q) = 1+ D weg ® (3.41) 
t=k 
Apply it to both sides of (3.35): 
Wig)yt) = Wq)G(q)ut 
which means that 


yA) = [L — W@]y@ + Wq)G(q)u) 
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In view of (3.41), the right side of this expression depends only on y(s).5 <t—k. 
and u(s).s < t — 1. Based on that information. we could thus produce a “guess” or 
prediction of v(t) as 


Sale — k} = [1 — W(g)) v(t) + Wlg)Glqg)u(t) (3.42) 
The trade-off considerations for the choice of W would then be: 
1. Select W (q) so that both W and WG have rapidly decaying filter 
coefficients in order to minimize the influence of erroneous initial 
conditions. 


2. Select W(q) so that measurement imperfections in y(t) are max- 
imally attenuated. 


(3.43) 


The later issue can be illuminated in the frequency domain: Suppose that v(t) = 
vu (t)+e(t). where yy(t) = G(g)u(t) ts the useful signal and v(t) isa measurement 
error. Then the prediction error according to (3.42) is 


e(t) = y(t) — Fle — k) = Wawi) (3.44) 


The spectrum of this error is, according to Theorem 2.2, 


Belo) = [We] b,(w) (3.45) 


where ®,,(@) is the spectrum of v. The problem is thus to select W , subject to (3.41). 
such that the error spectrum (3.45) has an acceptable size and suitable shape. 

A comparison with the k-step prediction case of Section 3.2 shows that the 
expression (3.42) is identical to (3.31) with W(q) = W,(q). It is clear that the 
qualification of a complete noise model in (3.1) allows us to analytically compute 
the filter W in accordance with aspect 2. This was indeed what we did in Section 
3.2. However, aspect | was neglected there. since we assumed all past data to be 
available. Normally, as we pointed out, this aspect is also less important. 


Fundamental Role of the Predictor Filter 

It turns out that for most uses of system descriptions it is the predictor form (3.20), 
or as in (3.31) and (3.42). that is more important than the description (3.1) or (3.35) 
itself. We use (3.31) and (3.42) to predict. or “guess,” future outputs; we use it for 
control design to regulate the predicted output. and so on. Now. (3.31) and (3.42) are 
just linear filters into which sequences {u(t}} and {y(t)} are fed, and that produce 
¥(r|t — k) as output. The thoughts that the designer had when he or she selected 
this filter are immaterial once it is put to use: The filter is the same whether W = W; 
was chosen as a trade-off (3.43) or computed from H as in (3.27) and (3.29). The 
noise model H in (3.1) is from this point of view just an alibi for determining the 
predictor. This is the viewpoint we are going to adopt. The predictor filter is the 
fundamental system description (Figure 3.1). Our rationale for arriving at the filter 
is secondary. This also means that the difference between a “stochastic system” 
(3.1) and a “deterministic” one (3.35) is not fundamental. Nevertheless, we find it 
convenient to use the description (3.1) as the basic system description. It is in a 
one-to-one correspondence with the one-step-ahead predictor (3.20) (see Problem 
3D.2) and relates more immediately to traditional system descriptions. 
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Figure 3.1 The predictor filter. 


3.4 SUMMARY 
Starting from the representation 
y(t) = G(q)u(t) + Alqg)e(t) 


we have derived an expression for the one-step-ahead prediction of ¥(f) fi.e., the 
best “guess” of y(t) given u(s) and y(s), s <t — 1]. This expression is given by 


ale — 1) = HCl ua) + [1 — Hg] vu) (3.46) 


We also derived a corresponding k-step-ahead predictor (3.31). We pointed 
out that one can arrive at such predictors also through deterministic observer con- 
siderations. not relying on a noise model H. We have stressed that the bottom line in 
most uses of a system description is how these predictions actually are computed: the 
underlying noise assumptions are merely vehicles for arriving at the predictors. The 
discussion of Chapters 2 and 3 can thus be viewed as a methodology for “guessing” 
future system outputs. 

It should be noted that calculations such as (3.46) involved in determining 
the predictors and regulators are typically performed with greater computational 
efficiency once they are applied to transfer functions G and H with more specific 
structures. This will be illustrated in the next chapter. 


3.5 BIBLIOGRAPHY 


Prediction and control are standard textbook topics. Accounts of the k-step-ahead 
predictor and associated control problems can be found in Astrém (1970)and Åström 
and Wittenmark (1984). Prediction is treated in detail in, for example, Anderson and 
Moore (1979)and Box and Jenkins (1970). An early account of this theory is Whittle 
(1963). 

Prediction theory was developed by Kolmogorov (1941), Wiener (1949), 
Kalman (1960), and Kalman and Bucy (1961). The hard part in these problems 
is indeed to find a suitable representation of the disturbance. Once we arrive at 
(3.1) via spectral factorization, or at its time-varying counterpart via the Riccati 
equation [see (4.95) and Problem 4G.3], the calculation of a reasonable predictor 
is, as demonstrated here, easy. Note, however (as pointed out in Problem 2E.3). 
that for non-Gaussian processes normally only the second-order properties can be 
adequately described by (3.1), which consequently is too simple a representation to 
accommodate more complex noise structures. The calculations carried out in Sec- 
tion 3.2 are given in Astrém (1970)for the case where G and H are rational with 
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the same denominators. Rissanen and Barbosa (1969)have given expressions for the 
prediction in input-output models of this kind when the lack of knowledge of the 
infinite past is treated properly [i.e., when the ad hoc solution (3.24) is not accepted]. 
The result is, of course. a time-varying predictor. 


3.6 PROBLEMS 


3G.1 Suppose that the transfer function G(z) is rational and that its poles are all inside 
|z| < yt. where u < 1. Show that 


lek) < e- ul 


where g(k) is defined as in (2.16). 
3G.2 Let A(q) and B(qg) be two monic stable and inversely stable filters. Show that 


=f JAA [BENT dw > 1 


with equality only if A(qg) = 1/ B(q). 
3E.1 Let 
H(q) = 1 — 1.107! + 0.347 
Compute H~'(q) as an explicit infinite expansion. 
3E.2 Determine the 3-step-ahead predictors for 


y(t) = = ett) 


1 — aq 
and 
y) = (1+ cq” ')e(t) 


respectively. What are the variances of the associated prediction errors? 


3E.3 Show that if (3.35) and (3.36) are complemented with the noise model H (q) = 1 then 
(3.39) is the natural predictor, whereas the noise model 


H(q) = X ag~ 


k=0 
leads to the predictor (3.40). 
3E.4 Let e(t) have the distribution 
1. wp.05 
e(t) = { —0.5, w.p. 0.25 
—1.5, wp. 0.25 


Let 
u(t) = A(qg)e(t) 
and let ĉ(r|t — 1) be defined as in the text. What is the most probable value (MAP) of 


u(t) given the information t(t|t — 1)? What is the probability that u(t) will assume a 
value between u(r|f — 1) — i and w(r|f — 1) + 19 


31.1 
3T.2 


37.3 


3T.4 
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Suppose that A(q) is inversely stable and monic. Show that AT! (g) is monic. 
Suppose the measurement error spectrum of v in (3.44) and (3.45) is given by 
,(w) = 2|R(e”)| 


for some monic stable and inversely stable filter R(q). Find the filter W. subject to 
(3.41) with k = 1. that minimizes 


Ee*(t) 
Hint: Use Problem 3G.2. 
Consider the system description of Problem 2E.4: 
x(t +1) = fx) + w(t) 
y(t) = Ax(t) + v(t) 


(x scalar). Assume that {v(z)} is white Gaussian noise with variance R- and that {u'(r}} 
is a sequence of independent variables with 


1, wp. 90.05 
u(t) = -r w.p. 0.05 
0, wp. 0.9 
Determine a monic filter W(q) such that the predictor 
D = (1 — Wig) y(t) 
minimizes 
E (y(t) — #0)" 
What can be said about 
E (y(s)lyc')? 
Consider the noise description 
ut) = e(t) + cet —1). Je] > 1. Eeu) =a (3.47) 


Show that e(f) cannot be reconstructed from x’ by a causal. stable filter. However, show 
that e(t) can be computed from v£; by an anticausal. stable filter. Thus construct a 
stable, anticausal predictor for u(t) given v(s),s > f+ 1. 


Determine a noise T(t) with the same second-order properties as u(f), such that 


U(r) = er) + lg — 1). jle i EE) = 4" (3.48) 
Show that U(r) can be predicted from V~} by a stable, causal predictor. [Measuring 
just second-order properties of the noise, we cannot distinguish between (3.47) and 
(3.48). However. when e(f) in (3.47) is a physically well defined quantity (although not 
measured by us), we may be interested in which one of (3.47) and (3.48) has generated 
the noise. See Benveniste, Goursat, and Ruget (1980).] 
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3D.1 


3D.2 


3D.3 


In the chapter we have freely multiplied. added. subtracted. and divided by transfer- 
function operators G(q) and H (q). Division was formalized and justified by Lemma 
3.1 and (3.11). Justify similarly addition and multiplication. 


Suppose a one-step-ahead predictor is given as 
Fale — 1) = Lig)utt — 1) + Lilg)ye — 1) 


Calculate the system description (3.1) from which this predictor was derived. 
Consider a stochastic process {v(t)} and let 


b(t) = E (vv) 


Define 
e(t) = u(t) — u(r) 


Let b(t) be an arbitrary function of v‘~'. Show that 
E (v(t) — wy > Eet) 
Hint: Use Ex? = E. E(x?\z). 


4 


MODELS OF LINEAR 
TIME-INVARIANT SYSTEMS 


A model of a system is a description of (some of) its properties, suitable for a certain 
purpose. The model need not be a true and accurate description of the system. nor 
need the user have to believe so, in order to serve its purpose. 

System identification is the subject of constructing or selecting models of dy- 
namical systems to serve certain purposes. As we noted in Chapter 1, a first step is 
to determine a class of models within which the search for the most suitable model 
is to be conducted. In this chapter we shall discuss such classes of models for linear 
time-invariant systems. 


4.1 LINEAR MODELS AND SETS OF LINEAR MODELS 


A linear time-invariant model is specified, as we saw in Chapter 2, by the impulse 


response (g(k)}7*, the spectrum ®,(@) = A |H (ei) |? of the additive disturbance. 
and. possibly, the probability density function (PDF) of the disturbance e(t). A 


complete model is thus given by 
y(t) = G(q)utt) + A(q)e(t) (41) 
fet), the PDF of e l 


with 


x x 
Gq) = Disha", Hlg) =1+ } hkg" (4.2) 
k=1 k=1 


A particular model thus corresponds to specification of the three functions G, 
H,and fe. It is in most cases impractical to make this specification by enumerating 
the infinite sequences {g(k)}. {#(k)} together with the function f(x). Instead one 
chooses to work with structures that permit the specification of G and H in terms ofa 
finite number of numerical values. Rational transfer functions and finite-dimensional 
state-space descriptions are typical examples of this. Also, most often the PDF fe is 
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not specified as a function, but described in terms of a few numerical characteristics, 
typically the first and second moments: 


Ee(t) = fitas = 0 
(4.3) 
Ee(t) = ferd =À 


It is also commen to assume that e(t) is Gaussian. in which case the PDF is entirely 
specified by (4.3). The specification of (4.1) in terms of a finite number of numerical 
values. or coefficients, has another and most important consequence for the purposes 
of system identification. Quite often it is not possible to determine these coefficients 
a priori from knowledge of the physical mechanisms that govern the svstem’s be- 
havior. Instead the determination of all or some of them must be left to estimation 
procedures. This means that the coefficients in question enter the model (4.1) as pa- 
rameters to be determined. We shall generally denote such parameters by the vector 
0. and thus have a model description 


y(t) = G(g. u(t) + H(q, Pelt) 


fe(x,@). the PDF of e(t); {e(t)} white noise 


The parameter vector 8 then ranges over a subset of R? , where d is the dimension 
of 8: 


6 € Dm C R? (4.5) 


Notice that (4.4) to (4.5) no longer is a model: it is a set of models. and it is for 
the estimation procedure to select that member in the set that appears to be most 
suitable for the purpose in question. [One may sometimes loosely talk about “the 
model (4.4).” but this is abuse of notation from a formal point of view.] Using (3.20). 
we can compute the one-step-ahead prediction for (4.4). Let it be denoted by ¥(1|@) 
to emphasize its dependence on 6. We thus have 


$10) = Hq, 0)G(q. Out) + [1 — Hig. 0] vt) (4.6) 


This predictor form does not dependon /f,(x. 6). In fact, as we stressed in Section 3.3, 
we could very well arrive at (4.6) by considerations that are not probabilistic. Then the 
specification (4.4) does not apply. We shall use the term predictor models for models 
that just specify G and H as in (4.4) or in the form (4.6). Similarly. probabilistic 
models will signify descriptions (4.4) that give a complete characterization of the 
probabilistic properties of the system. A parametrized set of models like (4.6) will be 
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called a model structure and will be denoted by M. The particular model associated 
with the parameter value @ will be denoted by M(@). (A formal definition is given 
in Section 4.5.) 

In the following three sections, different ways of describing (4.4) in terms of 0 
(i.e., different ways of parametrizing the model set) will be discussed. A formalization 
of the concepts of model sets. parametrizations, model structures. and uniqueness of 
parametrization will then be given in Section 4.5. while questions of identifiability 
are discussed in Section 4.6. 


4.2 AFAMILY OF TRANSFER-FUNCTION MODELS 


Perhaps the most immediate way of parametrizing G and H is to represent them as 
rational functions and let the parameters be the numerator and denominator coeffi- 
cients. In this section we shall describe various ways of carrying out such parametriza- 
tions. Such model structures are also known as black-box models. 


Equation Error Model Structure 


Probably the most simple input-output relationship is obtained by describing it as a 
linear difference equation: 


yO) + ay — 1) +- + an yt — na) 
= but — 1) +--+ + by, ult — ne) + elt) (4.7) 


Since the white-noise term e (ż ) here enters as a direct error in the difference equation, 
the model (4.7) is often called an equation error model (structure). The adjustable 
parameters are in this case 


0 = [a a...an, b,...bn,)" (4.8) 


a 


If we introduce 
Alq) =1+aq +- + ang ™ 


and 
Biq) = big ™ +- + bng ™ 


we see that (4.7) corresponds to (4.4) with 


Biq) 1 
G(¢g.é) = —.. H(q,d) = — 4.9 
q Ata) (q.9) A) (4.9) 
Remark. It may seem annoying to use g as an argument of A(q). being a 
polynomial in g~!. The reason for this is. however, simply to be consistent with the 
conventional definition of the z-transform: see (2.17). 


We shall also call the mode! (4.7) an ARX model, where AR refers to the 
autoregressive part A(q)y(t) and X to the extra input B(g)u(r) (called the exoge- 
neous variable in econometrics). In the special case where ng = 0, y(t) is modeled 
as a finite impulse response (FIR). Such model sets are particularly common in 
signal-processing applications. 
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The signal flow can be depicted as in Figure 4.1. From that picture we see that 
the model (4.7) is perhaps not the most natural one from a physical point of view: 
the white noise is assumed to go through the denominator dynamics of the system 
before being added to the output. Nevertheless, the equation error model set has 
a very important property that makes it a prime choice in many applications: The 
predictor defines a linear regression. 


>I 


u 
B y 
Oe: 
Figure 41 The ARX model structure. 


Linear Regressions 


Let us compute the predictor for (4.7). Inserting (4.9) into (4.6) gives 
$l) = Biq)ut) + [1 — Alq) ye) (4.10) 


Clearly. this expression could have more easily been derived directly from (4.7). 
Let us reiterate the view expressed in Section 3.3: Without a stochastic framework, 
the predictor (4.10) is a natural choice if the term e(t) in (4.7) is considered to be 
“insignificant” or “difficult to guess.” It is thus perfectly natural to work with the 
expression (4.10) also for “deterministic” models. 


Now introduce the vector 
git) = [-y@ — 1)... — y(t — na) u(t — 1)...u(t — mp) JF (4.11) 
Then (4.10) can be rewritten as 
$I) = 07 pt) = gne (4.12) 
This is the important property of (4.7) that we alluded to previously. The predictor 
is a scalar product between a known data vector y(f) and the parameter vector 0. 
Such a model is called a linear regression in statistics, and the vector y(t) is known 


as the regression vector. It is of importance since powerful and simple estimation 
methods can be applied for the determination of 6. 
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In case some coefficients of the polynomials A and B are known. we arrive at 
linear regressions of the form 


$(t|0) = p70 + u(t) (4.13) 


where u(t) is a known term. See Problem 4E.1 and also (5.67). The estimation of @ 
in linear regressions will be treated in Section 7.3. See also Appendix II. 


ARMAX Model Structure 


The basic disadvantage with the simple model (4.7) is the lack of adequate freedom 
in describing the properties of the disturbance term. We could add flexibility to that 
by describing the equation error as a moving average of white noise. This gives the 
model 


v(t) + ayy — 1) +--+ +a, Y(t — na) = Dutt —14+--- 
+ bna (t — np) + e(t) + celt — 1) +--+ + cn.e(t — ne) (4.14) 


With 
C(q) =+ eag +- H eng" 


it can be rewritten 


A(q)¥(t) = B(q)u(t) + C(q)e(t) (4.15) 
and clearly corresponds to (4.4) with 


B C 
caos u. maaa n 


. 4.16 
A(q) ia 


where now 


6 = [a1...4n, by... bn, c1 ee (4.17) 


In view of the moving average (MA) part C(q)e(t). the model (4.15) will 
be called ARMAX. The ARMAX model has become a standard tool in control 
and econometrics for both system description and control design. A version with 
an enforced integration in the noise description is the ARIMA(X) model (I for 
integration, with or without the X-variable u). which is useful to describe systems 
with slow disturbances; see Box and Jenkins (1970). It is obtained by replacing y(t) 
and u(t) in (4.15) by their differences Ay({#) = ¥(t)—¥(1—1) and is further discussed 
in Section 14.1. 
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Pseudolinear Regressions 


The predictor for (4.15) is obtained by inserting (4.16) into (4.6). This gives 


B A 
¥(t|8) = ane + [i — AEG 


or 


CIDE) = B(g)u(t) + [C(g) — Alo vWD) (4.18) 


This means that the prediction is obtained by filtering u and y through a filter with 
denominator dynamics determined by C(q). To start it up at time t = 0 requires 
knowledge of 


(010)... $(—ne + 110) 
¥(0)... y(—n* + 1). n* = max(n,. na) 


u(O)...u(—np, + 1) 


If these are not available. they can be taken as zero, in which case the prediction 
differs from the true one with an error that decays as c- u’. where u is the maximum 
magnitude of the zeros of C(z). It is also possible to start the recursion at time 
max(n*. np) and include the unknown initial conditions ¥(k|@).k = 1..... Ae. in 
the vector 6. 


The predictor (4.18) can be rewritten in formal analogy with (4.12) as follows. 
Adding [1 — C(g)] (t18) to both sides of (4.18) gives 


SED = Biu) + [1 — A(@)] y@) + [C(@) — 1 [yit) Eo] (4.19) 
Introduce the prediction error 
e(t,6) = y(t) — $l) 
and the vector 
y(t,8) = [-y(@t — 1)... — y — na) u(t — 1)... 
u(t — mp) elt — 1.0)...e(t — ne. 0) (4.20) 
Then (4.19) can be rewritten as 
SEIO) = plt. 0)8 (4.21) 
Notice the similarity with the linear regression (4.12). The equation (4.21) itself is, 


however, no linear regression, due to the nonlinear effect of @ in the vector y(t. 0). 
To stress the kinship to (4.12), we shall call it a pseudolinear regression. 


Sec. 4.2 A Family of Transfer-Function Models 85 


Other Equation-Error-Type Model Structures 


Instead of modeling the equation error in (4.7) as a moving average, as we did in 
(4.14), it can of course be described as an autoregression. This gives a model set 


A(qg) v(t) = Blqg)u(t) + e(t) (4.22) 


D(q) 
with 
Dig) = 14 dig! +--+ + dyg ™ 


which, analogously to the previous terminology, could be called ARARX. More 
generally, we could use an ARMA description of the equation error, leading to an 
“ARARMAX* structure 


C(q) 
D(q) 
which of course contains (4.7). (4.15), and (4.22) as special cases. This would thus 


form the family of equation-error-related model sets. and is depicted in Figure 4.2. 
The relationship to (4.4) as well as expressions for the predictions are straightforward. 


A(q)y(t) = Blq)u(t) + 


e(t) (4.23) 


e 


SIA 


u S 
= i 


Figure 4.2 The equation error model family: The model structure (4.23). 


Output Error Model Structure 


The equation error model structures all correspond to descriptions where the transfer 
functions G and H have the polynomial A as a common factor in the denominators. 
See Figure 4.2. From a physical point of view it may seem more natural to parametrize 
these transfer functions independently. 

If we suppose that the relation between input and undisturbed output w can 
be written as a linear difference equation. and that the disturbances consist of white 
measurement noise, then we obtain the following description: 


u(t) + fruit — 1) +---+ fawlt — ny) 
= bu(t — 1) +--+ + bault — no) (4.24a) 
w(t) + e(t) (4.24b) 


y(t) 
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With 
F(q) =1+ fig’) + +++ + fag ™ 
we can write the model as 


B(q) 
F(q) 


y(t) = u(t) + e(t) (4.25) 


The signal flow of this model is shown in Figure 4.3. 


Figure 4.3 The output error model structure. 


We call (4.25) an output error (OE) model (structure). The parameter vector 
to be determined is 


0 = [bi bz.. bn fi fre. fa (4.26) 


Since w(t) in (4.24) is never observed, it should rightly carry an index @, since it is 
constructed from u using (4.24a). That is, 


w(t,@) + fiw —1.6)+---4+ Fry wt — ny.) 
= bu(t —1)+--: + ba,u(t — no) (4.27) 
Comparing with (4.4), we find that H(q.@) = 1, which gives the natural predictor 


B(q) 


$8) = Fg) 


u(t) = w(t, 0) (4.28) 


Note that }(¢|@) is constructed from past inputs only. With the aid of the vector 
g(t,0) = [u —1)...u(t — np) —w{t — 1,6)... —wlt — ny. 0)\" (4.29) 
this can be rewritten as 
$e) = y(t, 0)0 (4.30) 


which is in formal agreement with the ARMAX-model predictor (4.21). Note that 
in (4.29) the w(# — 1.0) are not observed, but, using (4.28), they can be computed: 
w(t —k.0) = Ẹ(t —k|@),k =1,2,.... nz. 
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Box-Jenkins Mode! Structure 


A natural development of the output error model (4.25) is to further model the 
properties of the output error. Describing this as an ARMA model gives 


n BA C(q) 
y(t) = To” + Da O (4.31) 


In a sense, this is the most natural finite-dimensional parametrization, starting from 
the description (4.4): the transfer functions G and H are independently parametrized 
as rational functions. The model set (4.31) was suggested and treated in Box and Jenk- 
ins (1970). This model also gives us the family of output-error-related models. See 
Figure 4.4 and compare with Figure 4.2. According to (4.6), the predictor for (4.31) 
is 


: D(q) B(q) C(q) — D(q) 
HO). = — SMO E 
A E T 


y(t) (4.32) 


Figure 4.4 The BJ-model structure (4.31). 


A General Family of Model Structures 


The structures we have discussed in this section actually may give rise to 32 different 
model sets, depending on which of the five polynomials A, B, C, D, and F are used. 
(We have, however, only explicitly displayed six of these possibilities here.) Several 
of these model sets belong to the most commonly used ones in practice, and we 
have therefore reason to return to them both for explicit algorithms and for analytic 
results. For convenience, we shall therefore use a generalized model structure 


Bq) C(q) 


A(q)y(t) = aS + Dao h 
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Sometimes the dynamics from u to y contains a delay of ng samples, so some 
leading coefficients of B are zero; that is. 


Biq) = mg bng | +: “+B sn,—1g E A = q™B(q). bn, x 0 


It may then be a good idea to explicitly display this delay by 


n, B(Q) C(q) 
A(q)y¥(t) = q ™ —— u(t) + 
(qy) =q Fa Dia) 


e(t) (4.34) 
For easier notation we shall, however, here mostly use ng = 1 and (4.33). From 
expressions for (4.33) we can always derive the corresponding ones for (4.34) by 
replacing u(t) by u(t — n; + 1). 

The structure (4.33) is too general for most practical purposes. One or several of 
the five polynomials would be fixed to unity in applications. However, by developing 
algorithms and results for (4.33), we also cover all the special cases corresponding to 
more realistic model! sets. 


From (4.6) we know that the predictor for (4.33) is 


D(q)B(q) 


(tle) = 
ue C(q)F iq) 


(4.35) 


u(t) + 1 pas, ae) y(t) 


C(q) 


The common special cases of (4.33) are summarized in Table 4.1. 


TABLE 4.1 Some Common Black-box SISO Models as Special Cases 


of (4.33) l 
Polynomials Used in (4.33) Name of Model Structure 

B FIR (finite impulse response) 

AB ARX 

ABC ARMAX 

AC ARMA 

ABD ARARX 

ABCD ARARMAX 

BF OE (output error) 

BFCD BJ (Box-Jenkins) 


A Pseudolinear Form for (4.35) (*) 


The expression (4.35) can aiso be written as a recursion: 


C(q)F(q)¥(tl@) = F(q)[C(q) — DQ)A@] yO + D(q)Biq)u(t) (4.36) 
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From (4.36) we find that the prediction error 
e(t,0) = y(t) — ¥19) 


can be written 


ett.) = oe [Ain - Feu | (4.37) 
It is convenient to introduce the auxiliary variables 
w(t.0) = EL ul) (4.38a) 
and 
v(t. 0) = A(q)y(t) — w(t. 6) (4.38b) 
Then 
elt,0) = x(t) — F(A) = DD v9) (4.39) 
C(q) 


Let us also introduce the “state vector” 
g(t.6) = [-y(@t — 1),..... -y(@ — ng). u(t — 1).... u(t — np). 
— wt — 1.6),....-w(t — ny. 6), elt — 1.0)..... Elt — nc. 0). 
—v(t — 1.6)...., —v(t — ng. 8)" (4.40) 


With the parameter vector 


0 = [ar .. an, bi... Bay fi -+ fap C1 +- Cn di oo- dn (4.41) 


and (4.40) we can give a convenient expression for the prediction. To find this, we 
proceed as follows: From (4.38a) and (4.39) we obtain 


u(t,6) = but — 1) + --- + bault — np) 
— fiw — 1.0) — -+ — fa,wtt — ny. 8) (4.42) 
and 
e(f,0) = v(t.@) + dyu(t — 1.0) +... +d, v(t — na. 9) 
— celt — 1,0) —... — Cn Elt — ne. @) (4.43) 
Now inserting 


vlt, 0) = y(t) + ayt — 1) +- + an y(t — na) — wt.) 
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into (4.43) and substituting w(t. 0) with the expression (4.42), we find that 
e(t,0) = y(t) — 07 pit, 8) (4.44) 


Hence 


F110) = 07 pt.0) = "(1.000 (4.45) 


The two expressions, (4.36) and (4.45) can both be used for the calculation of 
the prediction. It should be noticed that the expressions simplify considerably in the 
special cases of the general model (4.33) that have been discussed in this section. 


Other Model Expansions 


The FIR model structure 


n 
G(q.9) = bp b;q™ (4.46) 
k=] 


has two important advantages: it is a linear regression (being a special case of ARX) 
and it is an output error model (being a special case of OE). This means. as we shall 
see later. that the model can be efficiently estimated and that it is robust against noise. 
The basic disadvantage is that many parameters may be needed. If the system has a 
pole close to the unit circle. the impulse response decays slowly, so n has then to be 
large to approximate the system well. This leads to the question whether it would 
be possible to retain the linear regression and output error features, while offering 
better possibilities to treat slowly decaying impulse responses. Generally speaking. 
such models would look like 


f 


G(q.8) = Ý %LE(G. a) (4.47) 
k=1 


where L;,(q, œ) represents a function expansion in the delay operator, which may 
contain a user-chosen parameter œ. This parameter would be treated as fixed in the 
model structure. in order to make (4.47) a linear regression. A simple choice would 
be 

-k 


Lx(q.a) = — 
q— 4 

where a is an estimate of the system pole closest to the unit circle. More sophisticated 

choices in terms of orthonormal basis expansions. see. e.g.. Van den Hof, Heuberger. 

and Bokor (1995). have attracted wide interest. In particular, Laguerre polynomials 

have been used in this context, (Wahlberg, 1991): 


1 — æq k-1 
mE ) (4.48) 


Ly(q,@) = 
q-a 
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where, again, it is natural to let @ be an estimate of the dominating pole (time 
constant). 


Continuous-time Black-box Models (*)} 


The linear system description could also be parameterized in terms of the continuous- 
time transfer function (2.22): 


v(t) = Ge(p. @)u(t) (4.49) 


Adjustments to observed. sampled data could then be achieved either by solving the 
underlying differential equations or by applying an exact or approximate sampling 
procedure (2.24). The model (4.49) could also be fitted in the frequency domain. to 
Fourier transformed band-limited input-output data. as described in Section 7.7. 

In addition to obvious counterparts of the structures already discussed, two 
specific model sets should be mentioned. The first-order system model with a time 
delay 


east 


Cn. mre 0 = [K. te. 1] (4.50) 
c(s.0) = CES = [K. te. T $ 


has been much used in process industry applications. Orthonormal function series 
expansions 
d-1 


G.(s.0) = 2 ae Fels) 0 = [ao.-->.ag-1]" (4.51) 


=0 


have been discussed in the early literature. and also, e.g.. by Belanger (1985). Like 
for discrete-time models, Laguerre polynomials appear to be a good choice: 


k 
f(s) = Var eat = all 


a+! 


a being a time-scaling factor. Clearly, the model (4.49) can then be complemented 
with a model for the disturbance effects at the sampling instants as in (2.23). 
Multivariable Case: Matrix Fraction Descriptions (*) 


Let us now consider the case where the input «u (t) is an m-dimensional vector and 
the output y(7) is a p-dimensional vector. Most of the ideas that we have described 
in this section have straightforward multivariable counterparts. The simplest case is 
the generalization of the equation error model set (4.7). We obtain 


y(t) + Axt — 1) +--+ + Aa, y(t — na) 
= Butt — 1) +- + Byult — np) + elt) (4.52) 


where the A; are p x p matrices and the B; are p x m matrices. 
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Analogous to (4.9). we may introduce the polynomials 


A(q) = I + Aig! + ees’ + Ang ™ 
(4.53) 
B(q) = Biq +--+ + Bag ™ 


These are now matrix polynomials in q7' meaning that A(q) is a matrix whose 


entries are polynomials in g~!. We note that the system is still given by 
y(t) = Gig. @)u(t) + A(q.@)e(t) (4.54) 


with 
G(q,9) = A-\(q)B(q).  H(q.0) = A`! (q) (4.55) 


The inverse AT! (q) of the matrix polynomial is interpreted and calculated in a 
Straightforward way as discussed in connection with (3.34). Clearly. G(q.@) will be 
a p x m matrix whose entries are rational functions of g~! (or q}. The factorization 
in terms of two matrix polynomials is also called a (left) matrix fraction description 
(MFD). A thorough treatment of such descriptions is given in Chapter 6 of Kailath 
(1980). 

We have not yet discussed the parametrization of (4.52) (i.e., which elements of 
the matrices should be included in the parameter vector @). Thisis a fairly subtle issue. 
which will be further discussed in Appendix 4A. An immediate analog of (4.8) could. 
however, be noted: Suppose all matrix entries in (4.52) (a total of na - p? +np: p-m) 
are included in @. We may then define the [na - p +np-m] x p matrix 


@ = [A142 ++ An, Bie Bn) (4.56) 
and the [na - p + np - m]-dimensional column vector 
—y(t — 1) 
—y(t — na) 
= 4.57 
g(t) pa (4.57) 
u(t — np) 
to rewrite (4.52) as 
v(t) = 07 p(t) + elt) (4.58) 


in obvious analogy with the linear regression (4.12). This can be seen as p different 
linear regressions. written on top of each other. all with the same regression vector. 

When additional structure is imposed on the parametrization. it is normally no 
longer possible to use (4.58), since the different output components will not employ 
identical regression vectors. Then a d-dimensional column vector @ and a p x d 
matrix e(t) has to be formed so as to represent (4.52) as 


y(t) = @(1)8 + elt) (4.59) 
See Problems 4G.6 and 4E.12 for some more aspects on (4.58) and (4.59). 
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In light of the different possibilities for SISO systems. it is easy to visualize a 
number of variants for the MIMO case. like the vector difference equation (VDE) 


v(t) + Aix — 1) +--+ + An vt — Ra) 
= Bu(t—1) +--+ + Bue — np) 
+ e(t) + Cye(t — 1) + +++ + Chet ~ no) (4.60a) 
or 
G(q.6) = A7'(q)B(q).  Hiq.0) = A7*(q)C@) (4.60b) 


which is the natural extension of the ARMAX model. A multivariable Box-Jenkins 
model! takes the form 


G(q.0) = F~'(q)B(q).  Hiq.8) = D™"(q)C(q) (4.61) 


and so on. The parametrizations of these MFD-descriptions are discussed in Ap- 
pendix 4A. 


4.3 STATE-SPACE MODELS 


In the state-space form the relationship between the input, noise. and output signals 
is written as a system of first-order differential or difference equations using an 
auxiliary state vector x(t). This description of linear dynamical systems became 
an increasingly dominating approach after Kalman’s (1960) work on prediction and 
linear quadratic control. For our purposes it is especially useful in that insights into 
physical mechanisms of the system can usually more easily be incorporated into 
state-space models than into the models described in Section 4.2. 


Continuous-time Models Based on Physical Insight 


For most physical systems it is easier to construct models with physical insight in 
continuous time than in discrete time. simply because most laws of physics (Newton's 
law of motion, relationships in electrical circuits. etc.) are expressed in continuous 
time. This means that modeling normally leads to a representation 


k(t) = F(@)x(t) + G(@)u(t) (4.62) 


Here F and G are matrices of appropriate dimensions (n xn and n x m, respectively, 
for an n-dimensional state and an m-dimensional input). The overdot denotes dif- 
ferentiation with respect to (w.r.t) time t. Moreover. @ is a vector of parameters that 
typically correspond to unknown values of physical coefficients, material constants. 
and the like. The modeling is usually carried out in terms of state variables x that 
have physical significance (positions. velocities, etc.), and then the measured outputs 
will be known combinations of the states. Let n(t} be the measurements that would 
be obtained with ideal, noise-free sensors: 


nit) = Ax(t) (4.63) 
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Using p for the differentiation operator. (4.62) can be written 
[pl — F(@))x(t) = G(@)u(t) 
which means that the transfer operator from u to 7 in (4.63) is 
nt) = Ge(p, 9)u(r) 
G,(p.0) = H [pi — F(Y ' G(@) (4.64) 


We have thus obtained a continuous-time transfer-function model of the system, as 
in (2.22), that is parametrized in terms of physical coefficients. 

In reality, of course. some noise-corrupted version of n(t) is obtained, resulting 
from both measurement imperfections and disturbances acting on (4.62). There are 
several different possibilities to describe these noise and disturbance effects. Here 
we first take the simplest approach. Other cases are discussed in (4.84) and (4.96) to 
(4.99), in Problem 4G.7. and in Section 13.7. Let the measurements be sampled at 
time instants r = kT. k = 1, 2, ---+, and the disturbance effects at those time instants 
be ur (kT). Hence the measured output is 


yRT) = Hx(kT) + vr(kT) = Gep. O)ult) + vr (kT) (4.65) 


Sampling the Transfer Function 


As we discussed in Section 2.1, there are several ways of transporting G,(p.@) to 
a representation that is explicitly discrete time. Suppose that the input is constant 
over the sampling interval 7 as in (2.3): 


u(t) = uy = u(kT), kT <t < (kK+1)T (4.66) 
Then the differential equation (4.62) can easily be solved fromt = kT tot =kT+T. 
yielding 
x(kT + T) = Ar(O)x(kKT) + Br(O)u(kT) (4.67) 
where 
Ar (0) = eF OF (4.68a) 
T 
Br(@) = f ef GO) dt (4.68b) 
t=0 


(See. e.g., Åström and Wittenmark, 1984.) 
Introducing q for the forward shift of T time units, we can rewrite (4.67) as 


[qi — Ar(@)]x(kT) = Br(O@)u(kT) (4.69) 

or 
mMkT) = Gr(q,9)u(kT) (4.70) 
Gr(q,9) = H[qi — Ar(9)\"' Br(@) (4.71) 
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Hence (4.65) can equivalently be given in the sampled-data form 
x(t) = Gr(qg.@)u(t) + vrit). CS 7.28 Sion (4.72) 


When (4.66) holds, no approximation is involved in this representation. Note. how- 
ever, that in view of (4.68) Gr(q.@) could be quite a complicated function of 8. 


Example 4.1 DC Servomotor 


In this example we shall study a physica! process. where we have some insight into 
the dynamic properties. Consider the de motor depicted in Figure 4.5 with a block 
diagram in Figure 4.6. The input to this system is assumed to be the applied voltage. 
u, and the output the angle of the motor shaft, 7. The relationship between applied 
voltage u and the resulting current į in the rotor circuit is given by the well-known 
relationship 


O= Rit a 
dt 


+ s(t) (4.73) 


where s(t) is the back electromotive force. due to the rotation of the armature circuit 
in the magnetic field: 


d 
s(t) = ku —nt 
(t) gO 
The current į gives a turning torque of 
Talt) = ka - i(t) 
on the motor shaft, which is also affected by a torque 7; (ft) from the load. Newton's 


law then gives 


2 


d“ d 
ru = Talt) — Telt) — fan) (4.74) 


Figure 4.5 The dc motor. 
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Figure 4.6 Block diagram of the dc motor. 


where J is the moment of inertia of the rotor plus load and f represents viscous 
friction. Assuming that the inductance of the armature circuit can be neglected. 
La ~ 0, the preceding equations can be summarized in state-space form as 


Tanal eee bata | ey a 
dt" na 0 —1/rt E B/t s y'r (kia 


[1 o]x() 


3 
~-~ 
~ 
— 
Il 


with 


„n Ih p- —& E Ra 
~ FR, tkk © FR + hake YI OTR + kik 


Assume now that the torque T; is identically zero. To determine the dynamics of 
the motor, we now apply a piecewise constant input and sample the output with the 
sampling interval 7. The state equation (4.75) can then be described by 


x(t + T) = Ar(@)x(t) + Br(@)u(t) (4.76) 


1 r(l—e Ti) _ [Bae T -rt +7) 
0 eTit iF Br (6) = | pa — e` TT) 


Also assume that v(ż). the actual measurement of the angle 7(t). is made with a 
certain error u(t): 


where 


Ar(@) = | | (4.77) 


v(t) = n) + v(t) (4.78) 
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This error is mainly caused by limited accuracy (e.g.. due to the winding of a po- 
tentiometer) and can be described as a sequence of independent random variables 
with zero mean and known variance R (computed from the truncation error in the 
measurement). provided the measurements are not too frequent. We thus have a 
model 


y(t) = Gr(g.O)u(t) + v(t) 
with u(t) being white noise. The natural predictor is thus 
FO) = Grig. Dult) = [1 O] {gl — Ar (O)]"! Bruce) (4.79) 


This predictor is parametrized using only two parameters 8 and t. Notice that if 
we used our physical insight to conclude only that the system is of second order 
we would use. say. a second-order ARX or OE model containing four adjustable 
parameters. As we shail see. using fewer parameters has some positive effects on the 
estimation procedure: the variance of the parameter estimates will decrease. The 
price is, however. not insignificant. The predictor (4.79) is a far more complicated 
function of its two parameters than the corresponding ARX or OE model of its four 
parameters. 3 


Equations (4.67) and (4.65) constitute a standard discrete-time state-space 
model. For simplicity we henceforth take T = 1 and drop the corresponding in- 
dex. We also introduce an arbitrary parametrization of the matrix that relates x to 
n: H = C(0). We thus have 


x(t +1) = A(P)x(t) + B(Ojult) (4.80a) 
y(t) = C(O)x(t) + u(t) (4.80b) 
corresponding to 
y(t) = G(q.0)u(t) + v(t) (4.81) 
G(q.0) = C(0) [qI — A]! Bie) (4.82) 


Although sampling a time-continuous description is a natural way to obtain the model 
(4.80). it could also for certain applications be posed directly in discrete time, with 
the matrices A. B, and C directly parametrized in terms of @. rather than indirectly 
via (4.68). 


Noise Representation and the Time-invariant Kalman Filter 


In the representation (4.80) and (4.81) we could further model the properties of 
the noise term {u(t)}. A straightforward but entirely valid approach would be to 
postulate a noise model of the kind 


u(t) = H(q.@)e(t) (4.83) 


with {e(1)} being white noise with variance A. The @-parameters in H(q.@) could 
be partly in common with those in G(g.@) or be extra additional noise model pa- 
rameters, 
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For state-space descriptions, it is, however. more common to split the lumped 


noise term v(t) into contributions from measurement noise v(t) and process noise 
w(t) acting on the states. so that (4.80) is written 


x(t +1) = A(P)x(t) + B(@)u(t) + w(t) 


y(t) = C(@)x(t) + v(t) 


Here {w(t)} and {v(t)} are assumed to be sequences of independent random vari- 
ables with zero mean values and covariances 


Ew(tw"(t) = Ri@) 
Ev(t)v'(t) = R2(6) (4.85) 

Ew(t)v'(t) = Ry2(0) 
The disturbances w(t) and v(t) may often be signals whose physical origins are 
known. In Example 4.1 the load variation 7;(f) was a “process noise,” while the 
inaccuracy in the potentiometer angular sensor v(t) was the “measurement noise.” 
In such cases it may of course not always be realistic to assume that these signals 


are white noises. To arrive at (4.84) and (4.85) will then require extra modeling and 
extension of the state vector. See Problem 4G.2. 


Let us now turn to the problem of predicting y(t) in (4.84). This state-space 
description is one to which the celebrated Kalman filter applies (see. e.g.. Anderson 
and Moore, 1979, for a thorough treatment). The conditional expectation of y(t). 
given data v(s), u(s), 5 < 1 (i.e., from the infinite past up to time t — 1). is, provided 
v and w are Gaussian processes, given by 


E(t +1,0) = A(@)x(t, 6) + Butt) + KOI) — C(O). 4)] 
¥(t16) = C0), 9) (4.86) 


Here K (@) is given as 
K (6) = [A(@)P@)CT) + Ri2(®)] [C(O)P(O)CT(O) + R2(6)]  (4.87a) 


where P(8) is obtained as the positive semidefinite solution of the stationary Riccati 
equation: 


P(0) = A(@@)P(O)A™(O) + Ri(@) — [A@)P(O)C™(0) + Ri2(9)] 


x [C(@)P(@)C™0) + R(0)] [A(@)P()C7O) + Rir(8)]" (4.87) 


Sec. 4.3 State-Space Models 99 


The predictor filter can thus be written as 
$010) = CO) [gI — AO) + KOCOT BO)u(t) 
+ C(@)[qi — A) + KCO KOE) (488) 
The matrix P10) is the covariance matrix of the state estimate error: 
PO) = E[x(t) — £0, 9)] [x() — iE, D) (4.89) 
Innovations Representation 
The prediction error 
y(t) — C(6)x(t.0) = CO) [x(t) — X(t. A)] + v(t) (4.90) 


in (4.86) amounts to that part of y(t) that cannot be predicted from past data: “the 
innovation.” Denoting this quantity by e(t) as in (3.25), we find that (4.86) can be 
rewritten as 


k(t + 1,6) = A(O)X(1, 6) + B)u(t) + K@)e(t) 


. (4.91a) 
y(t) = C(@)x(t, @) + elt) 


The covariance of e(t) can be determined from (4.90) and (4.89): 
Ee(re(t) = A(0) = C(O)P(6)C7(6) + R0) (4.91b) 


Since e(t) appears explicitly, this representation is known as the innovations form 
of the state-space description. Using the shift operator g, we can clearly rearrange 
it as 


y(t) = Gq, O)u(t) + Hq, @)e(t) (4.92a) 
G(q, 6) = C(@) [qi — A(@)]"' Be) 


(4.92b) 
H(q,8) = C@)[qi — A0) ' K(6) +1 


showing its relationship to the general model (4.4) and to a direct modeling of v(t) 
as in (4.83). See also Problem 4G.3. 


Directly Parametrized Innovations Form 


In (4.91) the Kalman gain K (8) is computed from A(@), C(@), Ri (@), Ry2(@). and 
R2(@) in the fairly complicated manner given by (4.87). It is an attractive idea to 
sidestep (4.87) and the parametrization of the R-matrices by directly parametrizing 
K(@) in terms of 9. This has the important advantage that the predictor (4.88) 
becomes a much simpler function of 8. Such a model structure we call a directly 
parametrized innovations form. 
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The R-matrices describing the noise properties contain in(n +1)+np+ 


l p(p + 1) matrix elements (discounting symmetric ones), while the Kalman gain 
K contains np elements (p = dim v.n = dim x). If we have no prior knowledge 
about the R-matrices and thus would need many parameters to describe them. it 
would therefore be a better alternative to parametrize K(@). also from the point 
of view of keeping dim @ small. On the other hand. physical insight into (4.84) 
may entail knowing. for example. that the process noise affects only one state and is 
independent of the measurement noise, which might have a known variance. Then 
the parametrization of K (@) via (4.85) and (4.87) may be done using less parameters 
than would be required in a direct parametrization of K (@). 


Remark. The parametrization in terms of (4.85) also gives a parametrization 
of the p(p + 1)/2 elements of A(@) in (4.91). A direct parametrization of (4.91) 
would involve extra parameters for A. which, however, would not affect the predic- 
tor. (Compare also Problems 7E.4 and 8E.2.) 


Directly parametrized innovations forms also contain black-box models that 
are in close relationship to those discussed in Section 4.2. 


Example 4.2 Companion Form Parametrizations 


In (4.91) let 
oT = [a a2 aa bı bs ba ky kə k;] 
and 
—dä () 
A(@) = | -a 0 1 
-a 0 0 
bı kı 
B(6) = | b» K(0) = | k 
b; kz 


c@)= [1 0 0] 


These matrices are said to be in companion form or in observer canonical form (see, 
e.g.. Kailath, 1980). It is easy to verify that with these matrices 


big! + boq7? + b3q™? 


CO [qi — A)! B) = ÁÁ —_. 
(lq @)] 1+ aiq7! + arq~* + aq-° 


and 


hig) + kog™? + bq? 


p 


C(0) [gi — AO! K(@) = —Á Á 
( ) [g ( )] ( ) 1 + ago! + aq" + aqa 
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so that 


1 + cg! + og + ega 
1+ COlI- AOT KO = a 
1+ aig7' + axqg7* + axq™ 
with 


ci = aj + ki. i = 1.2.3 


With this we have consequently obtained a parametrization of the ARMAX model 
set (4.15) and (4.16) for Ra = np = ny = 3. z 


The corresponding parametrization of a multioutput model is more involved 
and is described in Appendix 4A. 


Time-varying Predictors (+) 


For the predictor filter (4.86) and (4.87) we assumed all previous data from time 
minus infinity to be available. If data prior to time ¢ = 0 are lacking. we could 
replace them by zero, thus starting the recursion (4.86) at £ = 0 with *(0) = 0. 
and take the penalty of a suboptimal estimate. This was also our philosophy in 
Section 3.2. 


An advantage with the state-space formulation is that a correct treatment of 
incomplete information about tf < 0 can be given at the price of a slightly more 
complex predictor. If the information about the history of the system prior to = Qis 
given in terms of an initial state estimate xy(4) = X(0. @) and associated uncertainty 


My(@) = E[x(0) — x9(0)][x0) — xot)" (4.93) 


then the Kalman filter tells us that the one-step-ahead prediction is given by, (see. 
e.g., Anderson and Moore, 1979), 


a + 1,0) = AEC, O) + BO)u(t) + KEO (yt) — CORE. 8)] (4.94) 
Cre) = COX, O), X(0,8) = xal) 
K (t.0) = [A@)PU.0)C78) + Ri2(9)] 
x [C(@) PE. 6)C7O) + RO (4.95) 


P(t +1.80) = A(@@)P(r.8)A"(8) + RO) — K(t.9) 
x (C(O) Pt.0)C7() + R2(6)] K(t.0), PO,0) = My() 
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Now K (t. 0) determined by (4.95) converges, under general conditions, fairly rapidly 
to K(@) given by (4.87) (see, e.g.. Anderson and Moore. 1979). For many problems 
it is thus reasonable to apply the limit form (4.86) with (4.87) directly to simplify cal- 
culations. For short data records, though, the solution (4.93) to (4.95) gives a useful 
possibility to deal with the transient properties in a correct way. including possibly 
a parametrization of the unknown initial conditions x9(@) and Ip(@). Clearly, the 
Steady-state approach (4.86) with (4.87) is a special case of (4.94) to (4.95), corre- 
sponding to xo(9) = 0, Tlo(8) = P(@). 


Sampling Continuous-time Process Noise (+) 


Just as for the systems dynamics. we may have more insight into the nature of the 
process noise in continuous time. We could then pose a disturbed state-space model 


x(t) = F(@)x(t) + G(@)u(t) + wit) (4.96) 
where w (1) is formal white noise with covariance function 
ETD (s) = R,(0)d(t — s) (4.97) 


where ô is Dirac’s delta function. When the input is piecewise constant as in (4.66). 
the corresponding discrete-time state equation becomes 


X(KT + T) = Ar(O)x(kT) + Br(@)u(kT) + wr(kT) (4.98) 


where Az and By are given by (4.68) and wy (kT), k = 1.2.--- is a sequence of 
independent random vectors with zero means and covariance matrix 


T ` a 
Ewr(kT)w} (kT) = Ri(0) = Í eF OR (0e dr (4.99) 
0 


See Astrém (1970)for a derivation. 


State-space Models 


In summary, we have found that state-space models provide us with a spectrum of 
modeling possibilities: We may use physical modeling in continuous time with or 
without a corresponding time-continuous noise description to obtain structures with 
physical parameters 6. We can use physical parametrization of the dynamics part 
combined with a black-box parametrization of the noise properties, such as in the 
directly parametrized innovations form (4.91), or we can arrive at a noise model that 
is also physically parametrized via (4.96) to (4.99). Finally, we can use black-box 
state-space structures, such as the one of Example 4.2. These have the advantage 
over the input-output black box that the flexibility in choice of representation can 
secure better numerical properties of the parametrization (Problem 16E.1). 
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4.4 DISTRIBUTED PARAMETER MODELS (+) 


Models that involve partial differential equations (PDE). directly or indirectly, when 
relating the input signal to the output signal are usually called distributed parameter 
models. “Distributed” then refers to the state vector, which in general belongs to 
a function space, rather than R”. There are basically two ways to deal with such 
models. One is to replace the space variable derivative by a difference expression 
or to truncate a function series expansion so as to approximate the PDE by an 
ordinary differential equation. Then a “lumped” finite-dimensional model, of the 
kind we discussed in Section 4.3, is obtained. (“Lumped”™ refers to the fact that the 
distributed states are lumped together into a finite collection.) The other approach is 
to stick to the onginal PDE for the calculations, and only at the final. numerical. stage 
introduce approximations to facilitate the computations. It should be noted that this 
second approach also remains within the general model structure (4.4), provided the 
underlying PDE is linear and time invariant. This is best illustrated by an example. 


Example 4.3 Heating Dynamics 


Consider the physical system schematically depicted in Figure 4.7. It consists of a 
well-insulated metal rod. which is heated at one end. The heating power at time ¢ is 
the input u(t). while the temperature measured at the other end is the output y(s). 
This output is sampled at ¢ = 1, 2..... 


u(t) 
x(t) 
Figure 4.7 The heat-rod system. 


Under ideal conditions. this system is described by the heat-diffusion equation. 
If x(t. €) denotes the temperature at time ¢, E length units from one end of the rod. 


then 
ax(t. O°x(t. 
x(t.) Ea mts) (4.100) 
ðt 3E- 
where « is the coefficient of thermal conductivity. The heating at the far end means 
that 
Ox(t, 
WCD, keai (4.101) 
aE l-t 
where K is a heat-transfer coefficient. The near end is insulated so that 
Ox, & 
Ga) ag (4.102) 
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The measurements are 


v(t) = x(t.0) + v(t). | ae We ear (4.103; 
where {u(t)} accounts for the measurement noise. The unknown parameters are 
K 
0 = 4.104 
«| ( 
Approximating 
OE) = x(@t.E + AL) — ss + x(t.E — AL) E =k-AL 
3E- (AL) 


transfers (4.100) to a state-space model of order n = L/AL.where the state variables 
x(t.k - AL) are lumped representatives for x(7, E). k- AL <E < (K+ 1)-AL. 
This often gives a reasonable approximation of the heat-diffusion equation. 

Here we instead retain the PDE (4.100) by Laplace transforming it. Thus 
let X(s.&) be the Laplace transform of x(t. £) with respect to t for fixed £. Then 
(4.100) to (4.102) take the form 


sX(s.&) = KX"(s.&) 
X'(s,L) = K - U(s) (4.105) 
X‘(s.0) = 0 


Prime and double prime here denote differentiation with respect to £. and U (s) is 
the Laplace transform of u(t). Solving (4.105) for fixed s gives 


X(s,&) = Alsje E + B(syeR* 
where the constants A(s) and B(s) are determined from the boundary values 
X'(s.0) = 0 
X'(s,L) = K - U(s) 


which gives 


K - Uis) 
A(s) = B(s) = TE e RR (4.106) 
Inserting this into (4.103) gives 
Y(s) = X(s.0) + V(s) = G,(s.@)U(s) + V (s) (4.107) 
G,(s,@) = za (4.108) 


JSE (eta — eTEN) 


where V(s) is the Laplace transform of the noise {v(t)}. We have thus arrived at 
a model parametrization of the kind (4.49). With some sampling procedure and a 
model for the measurement noise sequence, it can be carried further to the form 
(4.4). Note that G,(s.@) is an analytic function of s although not rational. All our 
concepts of poles, zeros, stability. and so on, can still be applied. C 
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We can thus include distributed parameter models in our treatment of system 
identification methods. There is a substantial literature on this subject. See. for 
example. Banks. Crowley. and Kunisch (1983)and Kubrusly (1977). Not surprisingly. 
computational issues. choice of basis functions, and the like, play an important role 
in this literature. 


4.5 MODEL SETS, MODEL STRUCTURES, AND IDENTIFIABILITY: SOME 
FORMAL ASPECTS (+) 


In this chapter we have dealt with models of linear systems, as well as with para- 
metrized sets of such models. When it comes to analvsis of identification methods, 
it turns out that certain properties will have to be required from these models and 
model sets. In this section we shall discuss such formal aspects. To keep notation 
simple, we treat explicitly only SISO models. 


Some Notation 


For the expressions we shall deal with in this section, it is convenient to introduce 
some more compact notation. With 


T(q) = [G(q) H(q)] and x(t) = p | (4.109) 


we can rewrite (4.1) as 
y@) = T(q)x® (4.110) 
The model structure (4.4) can similarly be written 
y(t) = T(q.9)x(t).  T(q.0) = [Glq.6) H(q.9)] (4.111) 


Given the model (4.110), we can determine the one-step-ahead predictor (3.46). 
which we can rewrite as 


$(t\t — 1) = W(q)z(t) (4.112) 
with 
u(t) 
W(q) = [wa wa | z(t) = (4.113) 
y(t) 
Walqa) = HDG). Wa) = [1 - H~] (4.114) 


Clearly, (4.114) defines a one-to-one relationship between 7T (q) and W(q): 


T(q) + Waq) (4.115) 
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Remark. Based on (4.110). we may prefer to work with the k-step-ahead 
predictor (3.31). To keep the link (4.115). we can view (3.31) as the one-step-ahead 
predictor for the model (3.32). 


Models 


We noted already in (4.1) that a model of a linear system consists of specified transfer 
functions G(z) and H(z), possibly complemented with a specification of the predic- 
tion error variance A, or the PDF f,(x) of the prediction error e. In Sections 3.2 and 
3.3. we made the point that what matters in the end is by which expression future 
outputs are predicted. The one-step-ahead predictor based on the model (4.1) is 
given by (4.112). 

While the predictor (4.112) via (4.115) is in a one-to-one relationship with 
(4.110). it is useful to relax the link (4.115) and regard (4.112) as the basic model. 
This will, among other things, allow a direct extension to nonlinear and time-varying 
models. as shown in Section 5.7. We may thus formally define what we mean by a 
model: 


Definition 4.1. A predictor model of a linear, time-invariant system is a stable 
filter W (q). defining a predictor (4.112) as in (4.113). 


Stability. which was defined in (2.27) (applying to both components of W(q)) is 
necessary to make the right side of (4.112) well defined. While predictor models are 
meaningful also in a deterministic framework without a stochastic alibi. as discussed 
in Section 3.3. it is useful also to consider models that specify properties of the 
associated prediction errors (innovations). 


Definition 4.2. A complete probabilistic model of a linear. time-invariant system is 
a pair (W (q). fe(x)) ofa predictor model W(q) and the Ee, fe(x) of the associated 
prediction errors. 


Clearly, we can also have models where the PDFs are only partially specified (e.g. 
by the variance of e). 

In this section we shall henceforth only deal with predictor models and therefore 
drop this adjective. The concepts for probabilistic models are quite analogous. 

We shall say that two models W; (q) and W2(q) are equal if 


Wile”) = Woe’). almost all w (4.116) 


A model 
Wig) = [W.lg) W(q)] 


will be called a k-step-ahead predictor model if 
W.(q) = 3 wy(€)q7'. with w, (k) # 0 (4.117) 


and an output error model (or a simulation model) if W,(q) = 0. 
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Note that the definition requires the predictors to be stable. This does not 
necessarily mean that the system dynamics ts stable. 


Example 4.4 Unstable System 


Suppose that 
bq! 
G(q) = pora with jla] > 1 
and 
l 
H = 
9) 1 + aq7! 


This means that the model is described by 
y(t) + av(t — 1) = but — 1) + e(t) 
and the dynamics from u to y is unstable. The predictor functions are, however: 
W.(q) = —aq™.  Walq) = bq™ 
implying that 
(lt — 1) = —av(t — 1) + butt — 1) 
which clearly satisfies the condition of Definition 4.1. Z 


Model Sets 


Definition 4.1 describes one given model of a linear system. The identification prob- 
lem is to determine such a model. The search for a suitable model will typically be 
conducted over a set of candidate models. Quite naturally, we define a model set M* 
as 


M* = {W.al(g)la € A} (4.118) 


This is just a collection of models, each subject to Definition 4.1, here “enumerated” 
with an index a covering an index set A. 
Typical model sets could be 


M* = [* = {all linear models} 
that is, all models that are subject to Definition 4.1, or 


M;, = {all models such that W, (q) and W, (q) 
are polynomials of q7! of degree at most n} (4.119) 


or a finite model set 
M* = {W,(q). W2(q), W3(qg)} (4.120) 


We say that two model sets are equal, MÌ = MŽ, if for any W, in MY there exists 
a Wz in M} such that W, = W; [defined by (4.116)]. and vice versa. 
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Model Structures: Parametrization of Model Sets 


Most often a model set of interest is noncountable. Since we have to conduct a search 
over it for “the best model,” it is then interesting how the indexation is chosen. The 
basic idea is to parametrize (index) the set “smoothly” over a “nice” area and perform 
the search over the parameter set (the index set). To put this formally, we let the 
model be indexed by a d-dimensional vector @: 


W(q.@) 


To formalize “smoothly.” we require that for any given z, |z] > 1, the complex-valued 
function W(z.6@) of 8 be differentiable: 


d 
W(2.8) = 79 WE 9) (4.121a) 
Here 
W(z.0) = EZ 0) Wy(z. 8) | 
d d 
= | z. —W,(z, 4.121b 
Fag 9) rA o| ( ) 


isa d x 2 matrix. Thus the gradient of the prediction }(1|@) is given by 


YEO = $010) = Weg, A210) (4.1210) 


Since the filters Y will have to be computed and used when the search is carried out. 
we also require them to be stable. We thus have the following definition: 


Definition 4.3. A model structure M isa differentiable mapping from a connected. 
open subset Dm of R“ to a model set M*, such that the gradients of the predictor 
functions are stable. 


To put this definition in mathematical notation we have 
M: Dy 30-7 M) = W(q.@) e M* (4.122) 


such that the filter Y in (4.121) exists and is stable for @ € Dm. We will thus use 
M(@) to denote the particular model corresponding to 6 and reserve M for the 
mapping itself. 


Remark. The requirement that Dm should be open is in order for the deriva- 
tives in (4.121) to be unambiguously well defined. When using model structures. 
we may prefer to work with compact sets Dm. Clearly, as long as Dy is contained 
in an open set where (4.121) are defined. no problems will occur. Differentiability 
can also be defined over more complicated subsets of R? than open ones, that is. 
differentiable manifolds (see, e.g., Boothby, 1975). See the chapter bibliography for 
further comments. 
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Example 4.5 An ARX Structure 
Consider the ARX model 
y(t) tay — 1) = butt — 1) + bout — 2) + e(t) 


The predictor is given by (4.10), which means that 


W(q.6) = [biq + baqg? -aq'). @=[a b b] 


and 
0 =q! 
WV(q. 9) = qu 0 
q 0 
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The parametrized model sets that we have explicitly studied in this chapter 


have been in terms of (4.4), that is. 
yt) = G(q.0)ju(t) + H(q.60)e(t). 0 € Dm 


or using (4.111) 
y(t) = T(q.9)x@) 


It is immediate to verify that. in view of (4.114), 


1 H(q,@) 0 
14.0) = treo Me 0) 
4:0) (Hiq. 0)? an -G(q,0) 1 
where T'(q, 8) is the d x 2 matrix 
d d d 
T(q4.8) = —T (4.0) = | — — Hq. 
(q.6) = = 7@.0) = [7 G@.9) = HG.6)] 


Differentiability of W is thus assured by differentiability of 7. 


(4.123) 


(4.124) 


(4.125) 


It should be clear that all parametrizations we have considered in this chapter 
indeed are model structures in the sense of Definition 4.3. We have. for example: 


Lemma 4.1. The parametrization (4.35) together with (4.41) with 8 confined to 
Dm = {O|F(z) - C(z) has no zeros on or outside the unit circle} is a model struc- 


ture. 


Proof. We need only verify that the gradients of 


B(z)D(z) 
W,(z.¢) = ———_ 
ee C(z) F(z) 
and 


C{z) 
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with respect to @ are analytical in |z| > 1 for 8 € Day. But this is immediate since. 
for example 


ð — BE)D@=* 


— W, (2,8) = ———>~—_ 
ack [C]? F(z) G 


Lemma 4.2. Consider the state-space parametrization (4.91). Assume that the 
entries of the matrices A(@), B(0), K (0). and C (8) are differentiable with respect 
to 6. Suppose that 6 € Dm, with 

Dm = {8]all eigenvalues of A(@) — K(@)C(@) are inside the unit circle} 


Then the parametrization of the corresponding predictor is a model structure. 


Proof. See Problem 4D.1. = 


Notice that when K (@) is obtained as the solution of (4.87). then by a standard 
Kalman filter property (see Anderson and Moore, 1979), 


Dm = {8|[A(@), Rı(0)] stabilizable and [A(@). C(@)] detectable} (4.126) 
When relating different model structures, we shall use the following concept. 
Definition 4.4. A model structure M; is said t9 be contained in Ma, 

Mı C M, (4.127) 
if Dm, C Dm. and the mapping M; is obtained by E Mz tod € Dm.. 
The archetypical situation for (4.127) is when M defines nth-order models and M, 


defines mth-order models, m < n. One could think of M, as obtained from Mh by 
fixing some parameters (typically to zero). 


The following property of a model structure is sometimes useful: 


Definition 4.5. A model structure M is said to have an independently parametrized 
transfer function and noise model if 


n (4.128) 


T(q,0) = [Gq.p) Hq.n)] 


We note that in the family (4.33) the special cases with A(q) = 1 correspond to 
independent parametrizations of G and H. 
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Remark On “Finite Model Structures”: Sometimes the set of candidate mod- 
els is finite as in (4.120). It may still be desirable to index it using a parameter vector 
8, now ranging over a finite set of points. Although such a construction does not 
qualify as a “model structure“ according to Definition 4.3, it should be noted that 
the estimation procedures of Sections 7.1 to 7.4, as well as the convergence analysis 
of Sections 8.1 to 8.5, still make sense in this case. 


Model Set as a Range of a Model Structure 
A model structure will clearly define a model set by its range: 
M* = R(M) = Range M = {M(O)|O € Day} 
An important problem for system identification is to find a model structure whose 


range equals a given model set. This may sometimes be an easy problem and some- 
times highly nontrivial. 


Example 4.6 Parametrizing M3 


Consider the set M} defined by (4.119) with n = 3. If we take 


0 = [a a2 a3 bi b b]. d=6 


Dm = R® 
and 
Wy(q.0) = -ag~ — ag ™ — aq 
W..(q.9) = biq™ + bog + big 
we have obviously constructed a model structure whose range equals M; m 


A given model set can typically be described as the range of several different 
model structures (see Problems 4E.6 and 4E.9). 


Model Set as a Union of Ranges of Model Structures 


In the preceding example it was possible to describe the desired model set as the 
range of a model structure. We shall later encounter model sets for which this is 
not possible. at least not with model structures with desired identifiability properties. 
The remedy for these problems is to describe the model set as a union of ranges of 
different model structures: 


£ 
M* = URM) (4.129) 


i=? 
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This idea has been pursued in particular for representing linear multioutput systems. 
We shall give the details of this procedure in Appendix 4A. Let us here only remark 
that model sets described by (4.129) are useful also for working with models of 
different orders, and that they are often used. at least implicitly, when the order of a 
suitable model is unknown and is to be determined. 


identifiability Properties 


Identifiability is a concept that is central in identification problems. Loosely speaking. 
the problem is whether the identification procedure will yield a unique value of the 
parameter 8. and/or whether the resulting model is equal to the true system. We shall 
deal with the subject in more detail in the analysis chapter (see Sections 8.2 and 8.3). 
The issue involves aspects on whether the data set (the experimental conditions) is 
informative enough to distinguish between different models as well as properties of 
the model structure itself: If the data are informative enough to distinguish between 
nonequal models, then the question is whether different values of @ can give equal 
models. With our terminology, the latter problem concerns the invertibility of the 
model structure M (i.e.. whether M is injective). We shall now discuss some concepts 
related to such invertibility properties. Remember that these are only one leg of the 
identifiability concept. They are to be complemented in Sections 8.2 and 8.3. 


Definition 4.6. A model structure M is globally identifiable at 6” if 
M(6) = M(0*). 0 € Dm > 0 = 0* (4.130) 


Recall that model equality was defined in (4.116). requiring the predictor transfer 
functions to coincide. According to (4.115), this means that the underlying transfer 
functions G and H coincide. 


Once identifiability at a point is defined, we proceed to properties of the whole 
set. 


Definition 4.7. A model structure M is strictly globally identifiable if it is globally 
identifiable at all 0* € Dm. 


This definition is quite demanding. As we shall see. it is difficult to construct 
model structures that are strictly globally identifiable. The difficulty for linear sys- 
tems, for example. is that global identifiability may be lost at points on hyper-surfaces 
corresponding to lower-order systems. Therefore, we introduce a weaker and more 
realistic property: 


Definition 4.8. A mode! structure M is globally identifiable if it is globally iden- 
tifiable at almost all 6* € Dm. 
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Remark. This means that M is globally identifiable at all @* € Dm C Day. 
where 


Dm = foio € Dm:0 ¢ Dm) 
is a set of Lebesgue measure zero in R! (recall that Da, and hence 6D, is a subset 
of Rf). 


For corresponding local properties, the most natural definition of local identi- 
fiability of M at 9* would be to require that there exists an £ such that 


M6) = M(O*). 0 € B(0*.£) > 8 = OF (4.131) 


where B(6*. £) denotes an £-neighborhood of @*. 


(Strict) local identifiability of a model structure can then be defined analogously 
to Definitions 4.7 and 4.8. See also Problem 4G.4. 


Use of the Identifiability Concept 


The identifiability concept concerns the unique representation of a given system 
description in a model structure. Let 


sS: y(t) = Go(q)u(t) + Hy (q)e(t) (4.132) 


be such a description. We could think of it as a “true” or “ideal” description of the 
actual system. but such an interpretation is immaterial for the moment. Let M be a 
model structure based on one-step-ahead predictors for 


y(t) = G(qg.0)u(t) + H(q.0)e(t) (4.133) 


Then define the set Dr(S. M) as those @-values in Dy for which S$ = M(6). We 
can write this as 


Dr(S. M) = {@ € Da |Go(z) = G£. 6). Ho(z) = H (z. 9) almost all z} (4.134) 


This set is empty in case S ¢ M. (Here. with abuse of notation, M also denotes the 
range of the mapping M.) 

Now suppose that $ € M so that S = M(6,) for some value 6. Furthermore. 
suppose that M is globally identifiable at 69. Then 


Dr(S. M) = {6p} (4.135) 


One aspect of the choice of a good model structure is to select M so that (4.135) 
holds for the given description S. Since S is unknown to the user, this will typically 
involve tests of several different structures M. The identifiability concepts will then 
provide useful guidance in finding an M such that (4.135) holds. 
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4.6 IDENTIFIABILITY OF SOME MODEL STRUCTURES 


Definition 4.6 and (4.116) together imply that a model structure is globally identifiable 
at 0* if and only if 


G(z.0) = G(z.6*) and H(z,6) = H(z, 6") 


(4.136) 
for almost all z > @ = 6” 


For local identifiability, we consider only @ confined to a sufficiently small neighbor- 
hood of 6*. A general approach to test local identifiability is given by the criterion 
in Problem 4G.4. 


Global identifiability is more difficult to deal with in general terms. In this 
section we shall only briefly discuss identifiability of physical parameters and give 
some results for general black-box SISO models. Black-box multivariable systems 
are dealt with in Appendix 4A. 


Parametrizations in Terms of Physical Parameters 


Modeling physical processes typically leads to a continuous-time state-space model 
(4.62) to (4.63), summarized as (4.65) (T = 1): 


y(t) = Gop, @)u(t) + v(t) (4.137) 


For proper handling we should sample Ge, and include a noise model H so that 
(4.136) can be applied for identifiability tests. A simpler test to apply is 


G.(s.0) = G,(s.6*) almost alls ‘0 = 8*? (4.138) 


It is true that this is not identical to (4.136): When sampling Gc, ambiguities may 
occur; two different G, can give the same Gpr [cf. (2.24)]. Equation (4.138) is thus 
not sufficient for (4.136) to hold. However, with a carefully selected sampling in- 
terval, this ambiguity should not cause any problems. Also, a 0-parametrized noise 
model may help in resolving (4.138). This condition is thus not necessary for (4.136) 
to hold. However, in most applications the noise characteristics are not so significant 
that they indeed bear information about the physical parameters. All this means 
that (4.138) is a reasonable test for global identifiability of the corresponding model 
structure at 0*. 


Now, (4.138) is a difficult enough problem. Except for special structures there 
are no general techniques available other than brute-force solution of the equations 
underlying (4.138). See Problems 4E.5 and 4E.6 for some examples. A compre- 
hensive treatment of (4.138) for state-space models is given by Walter (1982), and 
Godfrey (1983)discusses the same problem for compartmental models. See also 
Godfrey and Distefano (1985). A general approach based on differential algebra is 
described in Ljung and Glad (1994b). 
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SISO Transfer-function Model Structures 


We shall now aim at an analysis of the general black-box SISO model structure 
(4.33) together with (4.41). Let us first illustrate the character of the analysis with 
two simple special cases. 


Consider the ARX model structure (4.7) together with (4.9): 


G(z,@) = AA H(z.0) = — 
A(z) A(z) (4.139) 


0 = [ai ... an, bı siba) 


Equality for H in (4.136) implies that the A -polynomials coincide, which in turn im- 
plies that the B -polynomials must coincide for the G to be equal. It is thus immediate 
to verify that (4.136) holds for all 8* in the model structure (4.139). Consequently, 
the structure (4.139) is strictly globally identifiable. 

Let us now turn to the OE model structure (4.25) with orders n, and ny. At 
0 = 0* we have 


B*(z) big as bee ™ 


G z,0*) ce a L 
( F*(z) 1 + fiz Eers bem 


Ea : ts pa (4.140) 
biz T een RENS t B*(z) 


Wakea zt F*(z) 


zf —nh 


é 


We shall work with the polynomial F*(z) = z”! F*(z) in the variable z. rather than 
with F*(z), which is a polynomial in z~'. The reason is that z”/ F*(z) always has 
degree np regardless of whether fẹ, is zero. Let B*(z) = 2" B*(z). and let @ be an 
arbitrary parameter value. We can then write (4.136). 


B(z Bz 
G(z, 6") = G(z.8) == BCs) = wit f nh B( ) 
F(z) F(z) 
as 
F(z)B*(z) — F*(z)B(z) = 0 (4.141) 
Since F*(z) isa polynomial of degree np., it has ny zeros: 
F*(aij)=0, i=l....,ny 
Suppose that B*(a;) ee a eee ny that is, B*(z) and F*(z) are coprime 


{have no common factors). Then (4.141) implies that 


Fa) =0 i=L...ny 
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[ifa zero a; has multiplicity n; , then differentiate (4.141) n; — 1 times to conclude that 
it is a zero of the same multiplicity to F(2)). Consequently. we have F(z) = F(z). 
which in turn implies that B(z) = B*(z) so that 0 = 8*. If. on the other hand. F* 
and B* do have a common factor so that 


F*(z) = y(s) Fi (2), B*(z) = yD B) 
then all 6, such that 
F(z) = PDF. Bz) = BoB) 


for arbitrary B(z) will yield equality in (4.141). Hence the model structure is neither 
globally nor locally identifiable at 0* [B(z) can be chosen arbitrarily close to y (2))}. 
We thus find that the OE structure (4.25) is globally and locally identifiable at 6” 
if and only if the corresponding numerator and denominator polynomials z"! F*(z) 
and 2" B*(z) are coprime. 


The generalization to the black-box SISO structure (4.33) is now straight- 
forward: 


Theorem 4.1, Consider the model structure M corresponding to 


B C 
A(lq)y(t) = Foi + pet (4.142) 


with 6, given by (4.41), being the coefficients of the polynomials involved. The 
degrees of the polynomials are na. np. and so on. This model structure is globally 
identifiable at @* if and only if all of (i) to (vi) hold: 
i. There is no common factor to all z”? A*(z). z”* B* (z), and z" C*(z). 
ii. There is no common factor to z”? B*(z) and 2”! F*(z). 
iii. There is no common factor to 2"°C*(z) and z”? D*(z). 
iv. If ng > 1. then there must be no common factor to 2”! F*(z) and z”? D*(z). 
v. lf my > 1, then there must be no common factor to ?™ A*(z) and z”! B*(z). 
vi. If ny > 1. then there must be no common factor to z" A*(z) and 2"°C*(z). 
The starred polynomials correspond to @*. 
Notice that several of the conditions (i) to (vi) will be automatically satisfied 
in the common special cases of (4.142). Notice also that any of the conditions (i) to 


(vi) can be violated only for “special” 6*. placed on hyper-surfaces in R. We thus 
have the following corollary: 


Corollary. The model structure given by (4.142) is globally identifiable. 
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Looking for a “True” System Within Identifiable Structures 


We shall now illustrate the usefulness of Theorem 4.1 by applving it to the problem 
of finding an M such that (4.135) holds for a given S. Suppose that S is given by 


Bold) ee (4.143) 


SeGigea = 
oa? = Aol) Fo(q) An(q) Dolg) 


with orders n?, n?. and so on (after all possible cancellations of common factors). 


This system belongs to the model structure M in (4.142) provided all the model 
orders are at least as large as the true ones: 


Ng 2 n°, np = n$. etc. (4.144) 
When (4.144) holds, let 0) be a value that gives the description (4.143): 
S = Mle) (4.145) 


Now. clearly, M will be globally identifiable at 6) and (4.135) will hold if we have 
equality in all of (4.144). The true orders n? soar are, however, typically not known, 
and it would be quite laborious to search for all combinations of model orders until 
equalities in (4.144) were obtained. The point of Theorem 4.1 is that such a search is 
not necessary: the structure M is globally identifiable at 9) under weaker conditions. 


We have the following reformulation of Theorem 4.1: 


Theorem 4.2, Consider the system description S$ in (4.143) with true polynomial 
orders n’, n9. and so on. as defined in the text. Consider model structure M of 
Theorem 4.1. Then S € M and corresponds to a globally identifiable 8 -value if and 
only if 


0 


i min(atg — Rna- b — nô, nc — n?) = 0. 


ii, min(y — nP.ng — ny) = 0. 
i Lal prapta 
i minne — ng. nd — na) = 0. 


iv. If na > 1. then also min(nf — ng. nd — n°) = 0. 


v, If ng > 1. then also min(na — n?. np — n$) =0. 


a 
vi. If n; > 1, then also min(na — n?, ne — n?) =0. 


With Theorem 4.2. the search for a true system within identifiable model struc- 
tures is simplified. If, for example. $ can be described in ARMAX form with finite 
orders n°. n? and n?. then we may take ng = np = ne = n(nf = na = 0) in M, 
giving a model structure, say, M,. By increasing n one unit at a time. we will sooner 


or later strike a structure where (i) holds and thus S can be uniquely represented. 


118 Chap.4 Models of Linear Time-Invariant Systems 


SISO State-space Models 


Consider now a state-space model structure (4.91). It is quite clear that the ma- 
trices A(@). B(@), C(@). and K(@) cannot be “filled” with parameters. since the 
corresponding input-output description (4.92) is defined by 3n parameters only 
(n = dim x). To obtain identifiable structures, it is thus natural to seek parametriza- 
tions of the matrices that involve 3n parameters; the coefficients of the two (n — 1)th 
order numerator polynomials and the coefficients of the common, monic ath order 
denominator polynomial or some transformation of these coefficients. One such 
parametrization is the observer canonical form of Example 4.2. which we can write 
in symbolic form as 


x(t +1,9) = A(@)x(t,9) + B(A)u(t) + K(@)e(t) 


y(t) = C(@)x(t, 9) + elt) (4.146a) 
x ! x x 
i Ei x x 

A(0) = , > B) = | |, K@) =] — | (4.146b) 
| . A 
Sgal — -æ oe ee o Š 
x | 0...0 x x 


c@) = [1 0...0] 


Here /,,-; isthe (n —1) x (n —1) unit matrix, while x marks an adjustable parameter. 
This representation is observable by construction, 

According to Example 4.2, this structure is in one-to-one correspondence with 
an ARMAX structure with na = Ap = Nn; = n. From Theorem 4.1 we know that 
this is identifiable at 6*. provided the corresponding polynomials do not all have a 
common factor, meaning that the model could be represented using a smaller value 
of n. It is well known that for state-space models this can only happen if the model 
is uncontrollable and/or unobservable. Since (4.146) is observable by construction. 
we thus conclude that this structure is globally and locally identifiable at 0* if and 
only if the two-input system {A(@*),[ B(@*) K(0*)]} is controllable. Note that 
this result applies to the particular state-space structure (4.146) only. 


4.7 SUMMARY 
In this chapter we have studied sets of predictors of the type 
$O) = Walqa. Out) + Wyr(g,O)v(t). 8 € Dm CR? (4.147) 
These are in one-to-one correspondence with model descriptions 


y(t) = Gig. @)u(t) + Hq. elt), 0 € Dm (4.148) 
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with {e(1)} as white noise. via 

W.(q,9) = H~'(q.8)G(q, 9) 

W,(q,9) = [1 — H7'(q,6)] 


When choosing models it is usually most convenient to go via (4.148). even if (4.147) 
is the “operational” version. 

We have denoted parametrized model sets, or model structures by M, while a 
particular model corresponding to the parameter value @ is denoted by M(@). Such 
a parametrization is instrumental in conducting a search for “best models.” Two 
different philosophies may guide the choice of parametrized model sets: 


1. Black-box model structures: The prime idea is to obtain flexible model sets 
that can accommodate a variety of systems, without looking into their internal 
structures. The input-output model structures of Section 4.2, as well as canoni- 
cally parametrized state-space models (see Example 4.2), are of this character. 


2. Model structures with physical parameters: The idea is to incorporate physical 
insight into the model set so as to bring the number of adjustable parameters 
down to what is actually unknown about the system. Continuous-time state- 
space models are typical representatives for this approach. 


We have also in this chapter introduced formal requirements on the predictor fil- 
ters W,,(¢,9) and W,(q, 0) (Definition 4.3) and discussed concepts of parameter 
identifiability (i.e., whether the parameter 6 can be uniquely determined from the 
predictor filters). These properties were investigated for the most typical black-box 
model structures in Section 4.6 and Appendix 4A. The bottom line of these results is 
that identifiability can be secured. provided certain orders are chosen properly. The 
number of such orders to be chosen typically equals the number of outputs. 


4.8 BIBLIOGRAPHY 


The selection of a parameterized set of models is, as we have noted. vital for the 
identification problem. This is the link between system identification and parameter 
estimation techniques. Most articles and books on system identification thus contain 
material on model structures, even if not presented in as explicit terms as here. 

The simple equation error model (4.7) has been widely studied in many con- 
texts. See. for example. Åström (1968), Hsia (1977), Mendel (1973), and Unbehauen. 
GGhring, and Bauer (1974)for discussions related to identification. Linear models 
like (4.12) are prime objects of study in statistics; see, for example, Rao (1973)or 
Draper and Smith (1981). The ARMAX model was introduced into system identi- 
fication in Astrém and Bohlin (1965)and is since then a basic model. The ARARX 
model structure was introduced into the control literature by Clarke (1967), but 
was apparently first used in a statistical framework by Cochrane and Orcutt (1949). 
The term pseudo-linear regression for the representation (4.21) was introduced by 
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Solo (1978). Output error models are treated, for example. in Dugard and Landau 
(1980)and Kabaila and Goodwin (1980). The general family (4.33) was first discussed 
in Ljung (1979). It was used in Ljung and Söderström (1983). Multivariable MFDs 
are discussed in Kailath (1980). When no input is present, the corresponding mode! 
structures reduce to AR, MA, and ARMA descriptions. These are discussed in many 
textbooks on time series, e.g., Box and Jenkins (1970); Hannan (1970): and Brillinger 
(1981). 

Black-box continuous transfer function models of the type (4.50) have been 
used in many cases oriented toward control applications. Ziegler and Nichols (1942)de- 
termine parameters in such models from step responses and se!f-oscillatory modes 
(see Section 6.1). 

State-space models in innovations forms as well as the general forms are treated 
in standard textbooks on control (e.g.. Astrém and Wittenmark, 1984). The use 
of continuous-time representations for estimation using discrete data has been dis- 
cussed. for example, in Mehra and Tyler (1973)and Åström and Källström (1976). 
The continuous-time mode! structure is usually arrived at after an initial model- 
ing step. See, for example. Wellstead (1979), Nicholson (1981), Ljung and Glad 
(1994a)and Cellier (1990)for general modeling techniques and examples. Direct 
identification of continuous-time systems is discussed in Unbehauen and Rao (1987). 

Distributed parameter models and their estimation are treated in, for example. 
Banks. Crowley, and Kunisch (1983). Kubrusly (1977), Qureshi, Ng, and Goodwin 
(1980) and Polis and Goodson (1976). Example 4.3 is studied experimentally in 
Leden, Hamza, and Sheirah (1976). 

The prediction aspect of models was emphasized in Ljung (1974)and Ljung 
(1978). Identifiability is discussed in many contexts. A survey is given in Nguyen 
and Wood (1982). Often identifiability is related to convergence of the parameter 
estimates. Such definitions are given in Åström and Bohlin (1965), Staley and Yue 
(1970). and Tse and Anton (1972). Identifiability definitions in terms of the model 
structure only was introduced by Bellman and Åström (1970), who called it “struc- 
tural identifiability.” Identifiability definitions in terms of the set Dr (S, M) [defined 
by (4.134)] were given in Gustavsson. Ljung, and Söderström (1977). The particular 
definitions of the concept of model structure and identifiability given in Section 4.5 
are novel. In Ljung and Glad (1994b)identifiability is treated from an algebraic per- 
spective. It is shown that any globally identifiable structure can be rearranged as a 
linear regression. 

A more general model structure concept than Definition 4.3 would be tolet Day 
be a differentiable manifold (see, e.g., Byrnes, 1976). However, in our treatment that 
possibility is captured by letting a model set be described as a union of (overlapping) 
ranges of mode! structures as in (4.129). This manifold structure for linear systems 
was first described by Kalman (1974), Hazewinkel and Kalman (1976) and Clark 
(1976). 

The identifiability of multivariable model structures has been dealt with in 
numerous articles. See. for example. Kailath (1980), Luenberger (1967), Glover 
and Willems (1974), Rissanen (1974), Ljung and Rissanen (1976), Guidorzi (1981), 
Gevers and Wertz (1984), Van Overbeek and Ljung (1982). and Correa and Glover 
(1984). 
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In addition to the parameterizations described in the appendix, approaches 


based on balanced realizations are described in Maciejowski (1985). Ober (1987). 
and Hanzon and Ober (1997). Parameterizations that are not identifiable. but may 
still have numerical advantages. are discussed by McKelvey (1994)and McKelvey 
and Helmersson (1996). 
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4G.1 


4G.2 


4G.3 


Consider the predictor (4.18). Show that the effect from an erroneous initial condition 
in ¥(s|8). s < 0. is bounded by c - 4‘, where u is the maximum magnitude of the zeros 
of C(z). 


Colored measurement noise: Suppose that a state-space representation is given as 
X(t +1) = Axa) + ByO)u(t) + w(t) 
v(t) = Cy(@)x(t) + vt) (4.149) 


where {w (r)} is white with variance R,(@). but the measurement noise {v(t)} is not 
white. A model for v(t} can, however. be given as 


u(t) = H(q.6)v(t) (4.150) 


with {1(f)} being white noise with variance R;:(@) and H(q.8) monic. Introduce a 
state-space representation for (4.150): 


El +1) = AOU) + KOl) 
u(t) = CADEC) + v(t) (4.151) 


Combine (4.149) and (4.150) into a single representation that complies with the structure 
(4.84) to (4.85). Determine R)(@). R\2(@}. and R2(@). Note that if w(t} is zero then 
the new representation will be directly in the innovations form (4.91). 


Verification of the Steady-State Kalman Filter: The state-space model (4.84) can be 
written (suppressing the argument @ and assuming dim y = 1) 


y) = Glg)u(t) + vilt) 


where 
Gig) = Cki — A)'B 


wt) = C(qi — A) wit) + u(t) 
Let Ri; = 0. The spectrum of {v,(?)} then is 
Diw) = Cle’ - 1 — AR (e . I — AT) CT +R, 
using Theorem 2.2. The innovations model (4.91) can be written 
ya) = Gult) + v(t) 
v(t) = Het) Hig) = Cq@I— AY'K +1 


The spectrum of {v2(t)} thus is 


(wv) = r[C(el® -1 — AYIK + [cle «1 AK 4:1)" 


where a is the variance of e(t). 
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(a) Show by direct calculation that 
piw) — P3(w) = 0 


utilizing the expressions (4.87) and (4.91b). The two representations thus have 
the same second-order properties, and if the noises are Gaussian, thev are indis- 
tinguishable in practice (sce Problem 2E.3). 

(b) Show by direct calculation that 


1— Hq) = 1- [1 + Car- AK]! 


C(ql —- A+ KC)'K 
and 
H7(q)G(q) = [1 + CI — Ay KY CI — A'B 
= C(qI — A + KCY'B 
(c) Note that the predictor (4.86) can be written as (4.88): 
Sal = Cig] — A+ KC) 'Bu(t) + C(ql — A+ KC) Ky(t) 


and thus that (a) and (b) together with (3.20) constitute a derivation of the steady - 
state Kalman filter. 


4G.4 Consider a model structure M, with predictor function gradient ¥ (z, 6) defined in 
(4.121). Define the d x d matrix 


r8) = f Wiel? awe. do 


(a) Show that M is locally identifiable at @ if I°,(@) is nonsingular. 
(b) Let T’(z, 8) be defined by (4.125). and let i 


T8) = | Teo [Teo] dw 


Use (4.124) to show that [2(@) is nonsingular if and only if F4 (8) is. [Note that 
by assumption H (q) has no zeros on the unit circle.) T'2(@) can thus be used to 
test local identifiability. 


4G.5 Consider an output error structure with several inputs 


ult) + --- + ——z,,(t) + elt) 

Fig) + F(q) 
Show that this structure is globally identifiable at a value 4* if and only if there is no 
common factor to all of the m + 1 polynomials 


y(t) = 


Z F*(z). 2° Bey i= 1... m 
nf = degree F*(z). my, = max degree B?(z) 


0* here corresponds to the starred polynomials. 
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4G.6 The Kronecker product of an m x n matrix A = (aij) anda p x r matrix B = (b;;) 
is defined as (see, e.g., Barnett. 1975) 


a; B an8 ...ain B 


anB axB...amB 
AQB= f l 


am B AmB .. -amn B 


This isan mp x nr matrix. Define the operator “col” as the operation to form a column 
vector out of a matrix by stacking its columns on top of each other: 


B! 


col B = ve lita (rp x 1 vector) 


where B/ is the jth column of B. 
Consider (4.56) to (4.59). Show that (4.58) can be transformed into (4.59) with 


6 = col®” 
ot) = g(t) & Ip 


where lp is the p x p unit matrix. Are other variants of 9 and ¢ also possible? 


4G.7 Consider the continuous-time state-space model (4.96) to (4.97). Assume that the 
measurements are made in wideband noise with high variance, idealized as 


¥(t) = Hx(t) + V(t) 
where T(t) is formal continuous-time white noise with covariance function 
E(D (s) = R3(8)d(t — s) 
Assume that U(r) is independent of w(t). Let the output be defined as 
1 (k+117 
Mkt OT = ma =F Hwa 
T Ji=kT 
Show that the sampled-data system can be represented as (4.98) and (4.99) but with 


YKT) = Cr(@)x(kT) + Dr (@)u(kT — T) + vr (kT) 


l 
Cr(9) = 7” PrO) 


1 f7 = 
Ewr(kT)vz (kT) = Ry2(@) = =f eF OTR (Ob! _(0)H! dr 


0 
l- 


1 f7 = 
Evy (kT)v7 (kT) = Ro(0) = + R216) + af Hr_,(9)R,\(@)O7_(0)H! dt 
~ JO 


T l T 
r (0) = Í eF Or dr: Dr(@) = ->f H,(@)dt 
0 Qa 
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4G.8 Consider the ARX model (4.7). Introduce the 5-operator 


4E.1 


4E.2 


4E3 


4E.4 


6=1-q" 


and reparametrize the models in terms of coefficients of powers of ô. Work out the 
details of a second-order example. Such a parametrization has the advantage of being 
less sensitive to numerical errors when the sampling interval is short. Middleton and 
Goodwin (1990). 


Consider the ARX model structure 
v(t) tayy(t -D+ + an Vt — Ng) 
= but — 1) +--+ + dau(t — np) + elt) 


where b; is known to be 0.5. Write the corresponding predictor in the linear regression 
form (4.13). 


Consider the continuous-time model (4.75) of the dc servo with 7, (t) = 0. Apply the 
Euler approximation (2.25) to obtain an approximate discrete-time transfer function 
that is a simpler function of @. 


Consider the small network of tanks in Figure 4.8. Each tank holds 10 volume units of 
fluid. Through the pipes A and E flows 1 volume unit per second. through the pipe B. 
@ units, and through C and D, 1 — @ units per second. The concentration of a certain 
substance in the fluid is u in pipe A {the input) and y in pipe E (the output). Write 
down a structured state-space model for this system. Assume that each tank is perfectly 
mixed (i.e.. the substance has the same concentration throughout the tank). (Models of 
this character are known as compartmental models and are very common in chemical 
and biological applications: see Godfrey. 1983.) 


B 


Figure 4.8 A network of tanks. 


Consider the RLC circuit in Figure 4.9 with ideal voltage source u,(t) and ideal current 
source u; (f). View this circuit as a linear time-invariant system with two inputs 


u,(t 
u(t) = fe 
u;(t) 
and one output: the voltage y(t). R, L.and C are unknown constants. Discuss several 


model set parametrizations that could be feasible for this system and describe their 
advantages and disadvantages. 
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u(t) 


Figure 4.9 A simple circuit. 


Hint: The basic equations for this circuit are 


di(t) 
mD = LO + yy + REO $+ wil 
1 t 
(t)= =f i(t)d 
y(t) T Í i(t)dt 
4E.5 A state-space model of ship-steering dynamics can be given as follows: 
u(r) a) a: 0 v(t) bii 
J r(t) | = | an dan 0 rin) | + | ba (ult) 
hit) 0 1 0 h(t) 0 


where u(r) is the rudder angle, u(t) the sway velocity, r(r) the turning rate. and A(t) 
the heading angle. 


(a) Suppose only u(t) and v(t) = A(t) are measured. Show that the six parameters 
aij. bij are not identifiable. 


h(t) 
parameters are globally identifiable at values such that the model is controllable. 
If vou cannot complete the calculations, indicate how you would approach the 
problem (reference: Godfrey and DiStefano, 1985). 


u(t) 
(b) Try also to show that if u(t) and y(#) = l are measured then all six 


4E.6 Consider the model structure (4.91) with 


A(@) = 


| 
re | 
I l 
à A 
J — 
© =. 
a | 
w 
~~ 
D 
Il 
a | 
> 
tw — 
a | 


C(@) 


lI 
ma 
— 

o 
b_i 
nx 
=—_ 

D 
Ne 
I 
= 
tae 

19 — 
eect 
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and another structure 


A, 0 iy 
A = 3 B = 
(7) l i a (n) bal 


Cm =([n x]. K(n) = | 


os i T l 
n= [h h ma kenra ke]. neD cR 
Determine D, and D- so that the two model structures determine the same model set. 
What about identifiability properties? 


4E.7 Consider the heated metal rod of Example 4.3. Introduce a five-state lumped approxi- 
mation and write down the state-space model explicitly. 


4E.8 Consider the OE model structure with m, = 2. nf = 1, and b, fixed to unity: 
-1 -2 
q™ + bg T 
(t) = —— ut) + eft). 6 = Íb; 
y(t) i+ fg” ) + ef [b fil 
Determine [°,(6) of Problem 4G.4 explicitly. When is it singular? 


4E.9 Consider the model structures 


Mı: y(t) = —av(t — 1) + bu(t — 1) 


D 
| 


Z El Du, = {lal < 1,b > 0} 


and 


My: y(t) = —(cosa)y(t — 1) + efu(t — 1) 


q 
"= [5 Da, = {0 <a < 1.7K < B < xX} 
Show that R(M,) = R( M2). Discuss possible advantages and disadvantages with the 
two structures. 


4E.10 Consider the de-motor model (4.75). Assume that the torque 7, can be seen as a white- 
noise zero mean disturbance with variance ø? (i.e.. the variations in T; are random 
and fast compared to the dynamics of the motor). Apply (4.97) to (4.99) to determine 
R,(@) and &)2(@) in a sampled model (4.84) and (4.85) of the motor, with A(@) and 
B(@) given by (4.77) and 


6@= |B]. y=y -0 
Y 


As an alternative. we could use a directly parametrized innovations form (4.91) with 
A(@) and B(@) again given by (4.77). but 


KO = | 7 | anae = (+ Bk k] 
2 


Discuss the advantages and disadvantages of these two parametrizations. 
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4E.11 Consider the system description 


x(t + 1) = ax(t) + bu(t) + E(t) 
y(t) = x(t) + e(t) 
where e(t) is white Gaussian noise and &(t) has the distribution 
Elt) = 0, w.p. 1-2 
&(t) = +1. w.p. 4/2 
E(t) = -1. w.p. 4/2 


The coefficients a. b. and A are adjustable parameters. Can this description be cast into 
the form (4.4)? If so. at the expense of what approximations? 


4E.12 Consider a multivariable ARX model set 


4T.1 
4F.2 
4T.3 


y(t) + Ayl — 1) + Azy(t — 2) = Buit — 1) + eft) 


where dim y = p = 2, dim u = m = 1. and where the matrices are parametrized as 


x x X xX 
ÁA = A= . = f 
a x 0 x x 
where œ and £ are known values and x indicates a parameter to be estimated. Write 
the predictor in the form 


$18) = 97(1)0 + u) 
with u(t) as a known term and give explicit expressions for œ and 0. Can this predictor 
be written in the form (4.58)? 
Determine the k-step-ahead predictor for the ARMAX model (4.15). 
Give an expression for the k-step-ahead predictor for (4.91). 
Suppose that W,(q) and W,.(q) are given functions. known to be determined as k-step- 
ahead predictors for the system description 


y(t) = G(q)utt) + A(g)e(t) 


Can G(e'”) and H (ef?) be uniquely computed from W,(e'®) and W,(e')? What if 
G and H are known to be of the ARMAX structure 


B 
(q) H C(q) 


)= — 


G = g = 
GUS o 1 = AQ 


where A, B.and C have known (and suitable) orders? 


4D.1 Prove Lemma 4.2. 
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APPENDIX 4A: IDENTIFIABILITY OF BLACK-BOX MULTIVARIABLE 
MODEL STRUCTURES 


The topic of multivariable model structures and canonical forms for multivariable 
systems is often regarded as difficult. and there is an extensive literature in the field. 
We shall here give a no-frills account of the problem. and the reader is referred to 
the literature for more insights and deeper results. See the bibliography. 

The issue still is whether (4.136) holds at a given @. Our development parallels 
the one in Section 4.6. We start by discussing polynomial parametrizations or MFDS, 
such as (4.55) to (4.61). and then turn to state-space models. Throughout the section. 
p denotes the number of outputs and m the number of inputs. 


Matrix Fraction Descriptions (MFD) 


Consider first the simple multivariable ARX structure (4.52) or (4.56). This uses 
G(z.6) = A'B), — -H(z.8) = ATH) (44.1) 


with @ comprising all the coefficients of the matrix polynomials (in 1/z) A(z) and 
B(z). These could be of arbitrary orders. Just as for the SISO case (4.139). it is 
immediate to verify that (4.136) holds for all 0*. Hence the model structure given 
by the MFD (4A.1) is strictly globally identifiable. 


Let us now turn to the output error model structure 
G(z.0) = FU(=)B(s), = A(z.0) = TF (4A.2) 


Tt should be noted that the analysis of (4A.2) contains also the analysis of the multi- 
variable ARMAX structure and multivariable Box-Jenkins models. See the corollary 
to Theorem 4A.1. which follows. 


The matrix polynomial F(z) is here a p x p matrix 


Faz) F(z)... Fip(z) 


Fite) PAA Fa) — 
ee ea FERRE ERS Vs Ft) ppl AV FO (PAS 


Fp) Fpy2(z)--- Fpp(2) 


whose entries are polynomials in 27: 


0) ()_- (e) -ùj A 

BSES Eis Peer a (4A.4) 

The degree of the F;; polynomial will thus be denoted by uj; and v = max v;j. 

Similarly, B(z) isa p x m matrix polynomial. Let the degrees of its entries be 
denoted by Hij. 
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The structure issue is really to select the orders vj; and ui; [i.e.. p(p + m) 
integers]. This will give a staggering amount of possible model structures. Some 
special cases discussed in the literature are 


l. vj =A. Hij =F (4A.5) 
2.4, = 0,10 FJ Uii = Ri Hij = fi (4A.6) 
3. vij = Hj. all i: Hij = rj. all z (4A.7) 


In all these cases we fix the leading matrix to be a unit matrix: 

Ps “Aes aby (4A.8) 
The form (4A.5) is called the “full polynomial form“ in Söderström and Stoica (1983). 
It clearly is a special case of (4A.7). It is used and discussed in Hannan (1969, 1976), 
Kashyap and Rao (1976). Jakeman and Young (1979), and elsewhere. 

The form (4A.6) gives a diagonal F -matrix and has been used. for example. in 
Kashyap and Nasburg (1974). Sinha and Caines (1977), and Gauthier and Landau 
(1978). 

The structure (4A.7) where the different columns are given different orders is 
discussed, for example. in Guidorzi (1975). Gauthier and Landau (1978), and Gevers 
and Wertz (1984). 


Remark. Inthe literature, especially the one discussing canonical forms rather 
than identification applications, often the polynomials 


F(z) = FC) = Fone + Fel one E Fe) (4A.9) 


in the variable z are considered instead of F(z) (just as we did the SISO case). 
Canonical representations of F(z) [such as the “Hermite form”: see Dickinson, 
Kailath. and Morf. 1974: Hannan. 1971a: or Kailath, 1980] will then typically involve 
singular matrices F°’. Such representations are not suitable for our purposes since 
y(t) cannot be solved for explicitly in terms of past data. 


The identifiability properties of the diagonal form (4A.6) can be analyzed by 
SISO arguments. For the others we need some theory for matrix polynomials. 
Some Terminology for Matrix Polynomials 


Kailath (1980). Chapter 6. gives a detailed account of various concepts and properties 
of matrix polynomials. We shall here need just a few: 

A p X p matrix polynomial P(x) is said to be unimodular if det P(x) = 
constant. Then P~!(x) is also a matrix polynomial. Two polynomials P(x) and 
Q(x) with the same number of rows have a common left divisor if there exists a 
matrix polynomial L(x) such that 


P(x) = L(x) P(x) 
Qix) = L(x) Q(x) 


for some matrix polynomials P(x) and O(x). 
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P(x) and Q(x) are said to be left coprime if all common left divisors are 
unimodular. This is a direct extension of the corresponding concept for scalar poly- 


nomials. A basic theorem says that if P(x) and Q(x) are left coprime then there 
exist matrix polynomials A(x) and B(x) such that 


P(x)A(x) + Q(x) B(x) = TI (identity matrix) (4A.10) 
Loss of Identifiability in Multivariable MFD Structures 


We can now state the basic identifiability result. 


Theorem 4A.1. Consider the output error MFD model structure (4A.2) with the 
polynomial degrees chosen according to the scheme (4A.7). Let @ comprise all 
the coefficients in the resulting matrix polynomials, and let F,(z) and B,(z) be the 
polynomials in 1/z that correspond to the value 0*. Let 


Dii) = diag(z"..... z"P) 
Da) = diag (e802) 


be diagonal matrices, with n; and r; defined in (4A.7), and define F,(z) = F, (z)Dp(<). 
B,(z) = B,(z)Dm(z) as polynomials in z. Then the model structure in question is 
globally and locally identifiable at @* if and only if 

F,(z) and B,(z) are left coprime (4A.11) 
Proof. Let @ correspond to F(z) and B(z), and assumé that 

G(z,0) = Gz, 6") = F(2)B) = Fp’ (z)B,(2) 
This can also be written as 
Dp(z)F~'(z) BQ) Dy (2) = DOR OB, Dp (2) 
where F and B are defined analogously to F, and B,. This gives 
B(z) = F,(z)F7'(z)B() (4A.12) 


When B, and F, are left coprime there exist, according to (4A.10), matrix polyno- 
mials X(z) and Y(z) such that 


F,(z)X (2) + B(s)¥(2) = I 
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Inserting (4A.12) into this expression gives 


F.@QF"@) Foxo + oro] = 1 
or 
F(2)X(z) + BEY) = FF = UC) 
Since the left side is a matrix polynomial in z, so is U (z). We have 
F(z) = URF) (4A.13) 
Note that, by (44.8), 
I = lim F(z) = lim F(z)D>"(z) = lim Fy(z)D>"(z) 
z= œ 2700 z= x 
Hence, multiplying (4A.13) by D71 (2) gives 
I= lim U(z) 
200 
which since U(z) is a polynomial in z, shows that U(z) = J, and hence F(z) = 
F(z), which in turn implies that B(z) = B,(z), and the if-part of the theorem has 
been proved. If (4A.11) does not hold, a common, nonunimodular, left factor U,.(z) 
can be pulled out from F,(z} and B,(z) and be replaced by an arbitrary matrix 
with the same orders as U,(z) [subject to the constraint (4A.8)}. This proves the 
only-if-part of the theorem. = 
The theorem can immediately be extended to a model structure 


G(z, 6) = F7}! (2}B(2), H(z,60) = D3 (z)C(z) (4A.14) 


with F and D subject to the degree structure (4A.7). It can also be extended to the 
multivariable ARMAX structure: 


G(z,0) = A'(z)B(z),  H(2,0) = A`! (z)C (2) (4A.15) 
Corollary 4A.1. Consider the ARMAX model structure (4A.15) with the degrees 
of the polynomial A(z) subject to (4A.7). Let A.(z) and 8,(z) = [B.(z) C,.(z)], 


a p x (m+ p) matrix polynomial, be the polynomials that correspond to 6*, as 
described in the theorem. Then the structure is identifiable at 0* if and only if 


A, (2) and Bs (z) are left coprime 
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The usefulness of these identifiability results lies in the fact that only p orders 
(the column degrees) have to be chosen with care to find a suitable identifiable 
structure, despite the fact that p -m [or even p - (m + p) in the ARMAX case] 
different transfer functions are involved. 


State-space Model Structures 


For a multivariable state-space model (4.146), we introduce a parametric structure. 
analogous to (4.146): 


0 1000 0 0 0 #90 x x 

0 0 1 0 0 0 0 0 9 x x 

x X MR WR WR VR RX XK X x x 

0000 10 0 0 0 x x 
A(@)= |x x x x x xX xX xX xX B(@) = |x x 

0 0 0 0 0 0 1 0 90 x x 

00 0 0 0 0 0 1 0 x xX 

0 0 0 0 0 0 0 0 1 x x 

x XxX XxXXXXXKX KX x x 

x XxX xX 

x x xX 

x xX x 

x xXx x 1000000 0 0 
K@)=|]x x x], C(@)=]000 1 0 0 0 0 0 |(4A.16) 

x xX xX 000 0;0 100 0 

X- XX 

x xXx x 

X OX X 


The number of rows with x`sin A(@) equals the number of outputs. We have 
thus illustrated the structure for n = 9, p = 3.m = 2. In words, the general 
structure can be defined as: 


Let A(@) initially be a matrix filled with zeros and with ones along the 
superdiagonal. Let then row numbers Fi. f2....,Fp, where ry = n, be 
filled with parameters. Take ro = 0 and let C(@) be filled with zeros, and 
then let row i have a one in column ri- +1. Let B(@) and K(@) be filled 
with parameters. (4A.17) 


The parametrization is uniquely characterized by the p numbers 7; that are to be 
chosen by the user. We shall also use 


Vi = Ti — ii-i 
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and call 
Vy = fvi Eara siS vp} (4A. 18) 


the multiindex associated with (4A.17). Clearly. 


P 
n=} vi (4A.19) 
i=l 


By a multiindex V, we henceforth understand a collection of p numbers 1; > 1 
Notice that the structure (4A.17) contains 2np + mn parameters regardless of v,,. 

The kev property of a “canonical” parametrization like (4A.16) is that the 
corresponding state vector x(t. 0) can be interpreted in a pure input-output context. 
This can be seen as follows. Fix time ¢. and assume that u(s) = e(s) = Q for 
s > t. Denote the corresponding outputs that are generated by the model for s > t 
by te(s|f — 1). We could think of them as projected outputs for future times s as 
calculated at time £ — 1. The state-space equations give directly 


Jolt — 1) = C(@)x(t.@) 
Vet + lit — 1) = C(0)A(0)x(t.8) 


subject to (4A.19). For given n and p. there exist ( ) different multi-indexes. 


(4A.20) 
tlt +n — lt — 1) = CWO)A""(6)x(t. 8) 
With 
C(O) 
016) = a (44.21) 
C10) A"7' (0) 
{the np x n observability matrix} and 
Saltlt — 1) 
Yen) = : 
Y(t +n — ilt — 1) 
We can write (4A.20) as 
Y?) = O,(O)x(t. 8) (4A.22) 


It is straightforward to verify that (4A.17) has a fundamental property: The np x n 
observability matrix O,,(@) will have n rows that together constitute the unit matrix. 
regardless of 0. The reader is invited to verify that row number kp + i of O, will be 


[0 0...0 1 0...0] 
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with 1 in position 7;_; +k + 1. This holds for 1 <i < p,0 < k < v;. Thus (4A.22) 
implies that the state variables corresponding to the structure (4A.17) are 


Xn kilt, O = HOG kt- iSo, p, O<k< v (4A.23) 


Here superscript (i) denotes the ith component of v. This interpretation of state 
variables as predictors is discussed in detail in Akaike (1974b)and Rissanen (1974). 
By the relation (4A.23), n rows are picked out from the np vector Y(t) in (4A.22). 
The indexes of these rows are uniquely determined by the multiindex v,,. Let them 
be denoted by 


k, = {k-UYpti: lek sv 1si < p} (4A.24) 


The key relationship is (4A.23). It shows that the state variables depend only on the 
input-output properties of the associated model. 

Consider now two values @* and @ that give the same input-output properties 
of (4A.17). Then ¥o(t +k|t — 1) = fọ- (t + k|t — 1). since these are computed from 
input-output properties only. Thus x(t, @) = x(t. 6*). Now, if 8* corresponds to a 
minimal realization. so must 6, and Theorem 6.2.4 of Kailath (1980)gives that there 
exists an invertible matrix T such that 


A(@*) = TA(@)T',  B(0*) = TB) 


(4A.25) 
K(6*) = TK(8), Ce") = CT"! 
corresponding to the change of basis 
x(t,0*) = Tx(t.6) i (4A.26) 


But (4A.26) together with our earlier observation that x(t. 0*) = x(t, 0) shows that 
T = I, and hence that 6* = @. 


We have now proved the major part of the following theorem: 


Theorem 4A.2. Consider the state-space model structure (4A.17). This structure 
is globally and locally identifiable at 0* if and only if {A(@*),[ B(@*) K(6*) ]} is 
controllable. 


Proof. The if-part was proved previously. To show the only-if-part. we find that if @* 
does not give a controllable system then its input-output properties can be described 
by a lower-dimensional model with an additional, arbitrary. noncontrollable model. 
This can be accomplished by infinitely many different 0’s. E 


It follows from the theorem that the parametrization (4A.17) is globally iden- 
tifiable, and as such is a good candidate to describe systems of order n. What is 
not clear yet is whether any nth-order linear system can be represented in the form 
(4A.17) for an arbitrary choice of multiindex V,. That is the question we now turn 
to. 
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Hankel-Matrix Interpretation 
Consider a multivariable system description 
y(t) = Golul) + Holqe(t) = To(q)x (t) (44.27) 
with 


To(q) = [Go(q) Ho(q)]. x() = ew 


Assume that 7%)(q) has full row rank [i.e., L7o(g) is not identically zero for any 
nonzero 1 x p vector L]. Let 


Tolg) = [0 1] + Ý Hig (4A.28) 
k=l 


be the impulse response of the system. The matrices Hy are here p x (p + m). 
Define the matrix 


Hı H: coe H, 
Hy, Hz... Asay 

Hs =| Hs Ay... Asso (4A.29) 
H, Ay +1 tee Ay +5~1 


This structure with the same block elements along antidiagonals is known as a block 
Hankel matrix. Consider the semifinite matrix H, = H,.x. For this matrix we have 
the following two fundamental results. 


Lemma 4A.1. Suppose that the n rows Jy, [see (4A.24)] of H, span all the rows of 
Hn41. Then the system (4A.27) can be represented in the state-space form (4A.17) 
corresponding to the multiindex vy. 


The proof consists of an explicit construction and is given at the end of this appendix. 


Lemma 4A.2. Suppose that 
rank 34,4; <n (4A.30) 


Then there exists a multiindex D, such that the n rows Jņ, span H,+41. The proof of 
this lemma is also given at the end of this appendix. 


It follows from the two lemmas that (4A.30) is a sufficient condition for (4A.27) 
to be an n-dimensional linear system {i.e., to admit a state-space representation of 
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order n). It is. however, well known that this is also a necessary condition. (H is 
obtained as the product of the observability and controllability matrices.) We thus 
conclude: 


Any linear system that can be represented in State-space form of order n 
can also be represented in the particular form (4A.17) for some multiindex 
Vn (4A.31) 


When (4A.30) holds, we thus find that the np rows of H, span an n-dimensional (or 
less) linear space. The generic situation is then that the same space is spanned by any 
subset of n rows of H,,. (By this term we mean if we randomly pick from a uniform 
distribution np row vectors to span an n-dimensional space the probability is 1 that 
any subset of n vectors will span the same space.) We thus conclude: 


A State-space representation in the form (4A.17) for a particular mul- 
tiindex V, is capable of describing almost all n-dimensional linear sys- 
tems. (4A.32) 


Overlapping Parametrizations 


Let Mz, denote the model structure (4A.17) corresponding to Ya. The result (4A.31) 
then implies that the model set 


M = J R(M;,) (4A.33) 


Ta 


(union over all possible multiindices T, ) covers all linear n-dimensional systems. 
We have thus been able to describe the set of all linear n-dimensional systems as the 
union of ranges of identifiable structures [cf. (4.129)]. From (4A.32), it follows that 
the ranges of My, overlap considerably. This is no disadvantage for identification: 
on the contrary. one may then change from one structure to another without losing 
information. The practical use of such overlapping paraimetrizations for identifica- 
tion is discussed in van Overbeek and Ljung (1982). Using a topological argument. 
Delchamps and Byrnes (1982)give estimates on the number of overlapping structures 
needed in (4A.33). See also Hannan and Kavalieris (1984). 


Connections Between Matrix Fraction and State-Space 
Descriptions 


In the SISO case the connection between a state-space model in observability form 
and the corresponding ARMAX model is simple and explicit (see Example 4.2). Un- 
fortunately. the situation is much more complex in the multivariable case. We refer 
to Gevers and Wertz (1984). Guidorzi (1981), and Beghelli and Guidorzi (1983)for 
detailed discussions. 

We may note. though, the close connection between the indexes v; used in 
(4A.17) and the column degrees n; in (4A.7). Both determine the number of time 
shifts of the ith component of y that are explicitly present in the representations. 
The shifts are. however, forward for the state space and backward for the MFD. The 
relationship between the v; and the observability indexes is sorted out in the proof 
of Lemma 4A.2. 
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A practical difference between the two representations is that the state-space 
representation naturally employs the state x(t) (n variables) as a memory vector 
for simulation and other purposes. When (4A.2) is simulated in a straightforward 
fashion. the different delayed components of y and u are stored. a total number of 
np +m-)_r; variables. This is of course not necessary, but an efficient organiza- 
tion of the variables to be stored amounts to a state-space representation. There 
are consequently several advantages associated with state-space representations for 
multivariable systems. 


Proofs of Lemmas 4A.1 and 4A.2 
It now remains only to prove Lemmas 4A.1 and 4A.2. 
Proof of Lemma 4A.1. Let 
x(t — 1) 
S(t) = | x@ — 2) 


Let [cf. (4A.20) to (4A.22)] 


oc 
fo(tle — k) = $ Hex — 8) (4A.34) 
t=k 
and R 
yo(t|t — 1) 
Yy (t) = : 


Jot +N — llr — 1) 
Then. from (4A.28) and (4A.29), 


Y(t) = HyS() (44.35) 
Now enumerate the row indexes i, of Iy, in (4A.24) as follows: 
i = 1. yn =pt+il TET fy = (y-1)-pt+il 
ina] = 2, iy = PP 2a i, = (2 -1)-pt+2 
(4A.36) 


tr, y+1 = Pp. Try -y+2 = pt p....,t, = (vyp—-1)-p+p 


Recall that 
Trk = Soy 
l 


Now construct the n-vector x(t) by taking its rth component to be the 7, th compo- 
nent of Yy (t). Let us now focus on the components 7; + p.i2 + p,....in + p of 
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(44.35). Collect these components into a vector (t + 1). They all correspond to 
rows of H,,41. But this matrix is spanned by x(t) by the assumption of the lemma, 
Hence 


Elt +1) = Fx(t) (44.37) 


for some matrix F. Now several of the components of € (t +1) will also belong to x(t). 
as shown in (4A.36). The corresponding rows of F will then be zeros everywhere 
except for a 1 in one position. A moment's reflection on (44.36) shows that the 
matrix F will in fact have the structure (4A.17). Also, with H given by (4A.17), 


y(t) = Hx(t) + e(t) (4A.38) 


Let us now return to (4A.37). Consider component r of x(t + 1), which by 
definition equals row i, of Yy(t + 1). This row is given as $% (t + klt) for some 
values j and k that depend oni,. But. according to (4A.34). we have 


Solt + kit) = folt + klt — 1) + Mex (t) (4A.39) 
Hence 


x(t + 1) = Sy (t + kle) 


SP te — 1k + DIE 1) + [Ax] 


But the first term of the right side equals component number i, + p of fy (t) fie. 
E(t + 1)]. Hence 


xt +1) =€&@+04+Mx(0) (4A.40) 


for some matrix M. Equations (4A.37), (4A.38), and (4A.40) now form a state-space 
representation of (4A.27) within the structure (4A.17) and the lemma is proved. Z 


Proof of Lemma 4A.2. The defining property of the Hankel matrix Hy in (4A.29) 
means that the same matrix is obtained by either deleting the first block column (and 
the last block row) or by deleting the first block row. This implies that. if row į of 
block row k [i.e., row (k — 1) p + i] lies in the linear span of all rows above it. then 
so must row ! of block k + 1. 
Now suppose that 
rank Hany = n 


and let us search the rows from above for a set of linearly independent ones. A row 
that is not linearly dependent on the ones above it is thus included in the basis; the 
others are rejected. When the search is finished, we have selected n rows from Hy41. 
The observation mentioned previously implies that, if row kp + i is included in this 
basis for k > 1. then so is row (k — 1)p + i. Hence the row indexes will obey the 
structure 

l pti, 2p4+i,...,44.-Dptl 


2 pt+2, 2p4+2....,(o. —1l)p+2 


P. Pp., 2ptp....,(6,—-l)pt p 
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for some numbers {o;} that are known as the observability indexes of the system. 
Since the total number of selected rows is n. we have 


P 
y o =n 
| 


The rows thus correspond to the multiindex o,, asin (4A.24) and the lemma is proved. 
Notice that several other multiindexes may give a spanning set of rows; one does not 


me 


have to look for the first linearly independent rows. = 


MODELS FOR TIME-VARYING 
AND NONLINEAR SYSTEMS 


While linear, time-invariant models no doubt form the most common way of de- 
scribing a dynamical system, it is also quite often useful or necessary to employ other 
descriptions. In this chapter we shall first, in Section 5.1. discuss linear, time-varying 
models. In Section 5.2 we deal with models with nonlinearities in form of static. 
non-linear elements at the input and/or the output. We also describe how to han- 
dle nonlinearities that can be introduced by suitable nonlinear transformations of 
the raw measured data. In Section 5.3 we describe nonlinear models in state-space 
form. So far the development concerns model structures, where the nonlinearities 
are brought in based on some physical insight. In Section 5.4 we turn to general non- 
linear models of black-box type. and describe the general features of such models. 
Particular instances of these. like artificial neural networks. wavelet models, etc. are 
then dealt with in Section 5.5, while fuzzy models are discussed in Section 5.6. Finally, 
in Section 5.7 we give a formal account of what we mean by a model in general. thus 
complementing the discussion in Section 4.5 on general linear models. 


5.1 LINEAR TIME-VARYING MODELS 
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Weighting Function 


In Chapter 2 we defined a linear system as one where a linear combination of inputs 
leads to the same linear combination of the corresponding outputs. A genera! linear 
system can then be described by 


yt) = Do gult - k) + vle) (5.11 


k=1 


If we write 
alk) = gt,t — k} 
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we find that 
r—l 


yt) = » g(t. s)u(s) + v(t) (5.2) 


s=-X 


where g(t, s), t = s.s +1, ..., isthe response at time ¢ to a unit input pulse at time 
s. The function g(f, 5) is also known as the weighting function. since it describes the 
weight that the input at time s has in the output at time ż. 

The description (5.1) is quite analogous to the time-invariant model (2.8), ex- 
cept that the sequence g,(k) carries the time index f. In general, we could introduce 
a time-varying transfer function by 


x 
G,(q) = Y grlkig* (5.3) 


k=1 


and repeat most of the discussion in Section 4.2 for time-varying transfer functions. 
In practice, though, it is easier to deal with time variation in state-space forms. 


Time-Varying State-Space Model 


Time variation in state-space models (4.91) is simply obtained by letting the matrices 
be time varving: 


x(t + 1,8) 
y(t) 


A(@)x(t.9) + B@)u(t) + K: (et) 
C,(0)x(t.0) + elt) 


(5.4) 


The predictor corresponding to (4.86) then becomes 


X(t + 1,6) = [A,(@) — K,(O)C,(8)] X(t, 6) + B,(O)utt) + K, (A) y(t) (5.5) 
SIO) = C,O)X(t. 8) 


Notice that this can be written 


SEOD = Y wrk Our -k+ uy kor -— WH 66) 
k=1 k=1 
with 
t-1 
wi (k,6) = C6) | | [40 — Kj()Cj@)] B-10) 
¡=f =k 
Í (5.7) 
t-l 


w? (k,0) = C0) [| [A40 — K 0C 0] K-O) 


j=t—k 
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Similarly. we could start with a time-varying model like (4.84) and (4.85). where the 
matrices A. B, C, Ri, Riz, and R2 are functions of ¢. The corresponding predictor 
will then be given by (4.94) and (4.95). 

Two common problems associated with time-invariant systems in fact lead to 
time-varying descriptions: nonequal sampling intervals and linearization. If the 
system (4.62) and (4.63) is sampled at time instants t = tg, k = 1,2,.... we can still 
apply the sampling formulas (4.66) to (4.68) to go from & to (41, using Ty = th.) —h. 
If this sampling interval is not constant, (4.67) will be a time-varying system. A related 
case is multirate systems. i.e.. when different variables are sampled at different rates. 
Then the C,(@) matrix in (5.4) will be time varying in order to pick out the states 
that are sampled at instant ¢. Missing output data can also be seen as non-uniform 
sampling. See Section 14.2. 


Linearization of Nonlinear Systems 


Perhaps the most common use of time-varying linear systems is related to lineariza- 
tion of a nonlinear system around a certain trajectory. Suppose that a nonlinear 
system is described by 


x(t +1) = f x(t) ut) +r (x(t) u(r) w(t) aa 
y(t) = A (x(t) + m (x(t), u(t)) vlt) l 


Suppose also that the disturbance terms {w(t)} and {v(t)} are white and small. and 
that the nominal, disturbance-free (w(t) = 0; v(t) = 0) behavior of the system cor- 
responds to an input sequence u*(t) and corresponding trajectory x*(t). Neglecting 
nonlinear terms, the differences 


Ax(t) = x(t) — x*(t) 
Ay(t) = y(t) — h(x*()) 
Au(t) = u(t) — u*(t) 
are then subject to 
Ax(t +1) = F(t)Ax(t) + G(t)Au(t) + wr) (59) 
Ay(t) = H(t) Ax(t) + v(t) 


where 


a ð 
F(t) = — f (x,u) ; G(t) = —f(x.u) 
Ox Ou 


x*(r).u*(t) 


x*(t).un(t) 


ð 
H(t) = ax 


x*(t) 
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Here we have neglected cross terms with the disturbance term (like Ax-v), in view of 
our assumption of small disturbances. In (5.9), w(t) and V(t) are white disturbances 
with the following covariance properties: 


Rit) = Eww (t) = r (x*(t). u(t) Eww (rr? (x*(1), u* o) 
R(t) = EVU) (1) = m(x*(t). u* (t)) Evry (em (x" (t). u* (1) (5.10) 
Riot) = r (x*(t), u” (t)) Ewlt)v (ym? (x* (t), u”) 


This model is now a linear. time-varying, approximate description of (5.8) in a vicinity 
of the nominal trajectory. 


5.2 MODELS WITH NONLINEARITIES 


A nonlinear relationship between the input sequence and the output sequence as in 
(5.8) clearly gives much richer possibilities to describe systems. At the same time, 
the situation is very flexible and it is quite demanding to infer general nonlinear 
structures from finite data records. Even a first order model (5.8) (dim x=1) with- 
out disturbances is specified only up to members in a general infinite-dimensional 
function space (functions f(-.-) and h(-)), while the corresponding linear model 
is characterized in terms of two real numbers. Various possibilities to parameterize 
general mappings from past data to the predictor will be discussed in Section 5.4. Itis 
however always wise first to try and utilize physical insight into the character of pos- 
sible nonlinearities and construct suitable model structures that way. In this section 
we shall deal with what might be called semi-physical modeling. by which we mean 
using simple physical understanding to come up with the essential nonlinearities and 
incorporate those in the models. 


Wiener and Hammerstein Models 


It is a quite common situation that while the dynamics itself can be well described by 
a linear system, there are static nonlinearities at the input and/or at the output. This 
will be the case if the actuators are nonlinear, e.g., due to saturation, or ifthe sensors 
have nonlinear characteristics. A model with a static nonlinearity at the input is called 
a Hammerstein model, while we talk about a Wiener model if the nonlinearity is at 
the output. See Figure 5.1. The combination will then be a Wiener- Hammerstein 
model. The parameterization of such models is rather straightforward. Consider 
the Hammerstein case. The static nonlinear function f(-) can be parameterized 
either in terms of physical parameters, like saturation point and saturation level, or 
in black-box terms like spline-function coefficients. This gives f(-. n). If the linear 
model is G(qg. @), the predicted output model will be 


Fele. n) = G(q. 0) f (u(t). n) (5.11) 


which is not much more complicated than the models of the previous chapter. 
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u(t) MY) 


———__ f 


F(u(t)) 


Linear Model 


2(t) y(t) = f(z) 


Linear Medel 


Figure §.1 Above: A Hammerstein model. Below: A Wiener model. 


Linear Regression Models 


In (4.12) we defined a linear regression model structure where the prediction is linea: 
in the parameters: 


FIO) = y" (Ne (5.12 


To describe a linear difference equation. the components of the vector (t) (i.e.. the 
regressors). were chosen as lagged input and output values: see (4.11). When usin; 
(5.12) it is. however, immaterial how y(t) is formed: what matters is that it is a know 
quantity at time ¢. We can thus let it contain arbitrary transformations of measure: 
data. Let. as usual. y’ and uw’ denote the input and output sequences up to time £ 
Then we could write 


S610) = bpl. 7!) +... + Opal. yA") = 9 (NO (5.13 


with arbitrary functions gy; of past data. The structure (5.13) could be regarded a 
a finite-dimensional parameterization of a general. unknown, nonlinear predictor 
The key is how to choose the functions g;(u‘, v’~!). There are a few possibilities: 


e Try black box expansions. We could construct the regressors as typical (poly 
nomial) combinations of past inputs and outputs and see if the model is abk 
to describe the data. This will be contained in the treatment of Section 5.4 an 
also corresponds to the so-called GMDH-approach (see Ivakhnenko. 196: 
and Farlow, 1984). It normally gives a large number of possible regressors. I 
is somewhat simpler for the Hammerstein model. where we may approximat: 
the static nonlinearity by a polynomial expansion: 


f(u) = aut AU? +... + Amu” (5.14 
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Each power of u could then pass a different numerator dynamics: 
ADY) = Bi (qult) + Br(qu?(t) +... + Bm(qyu™(t) (5.15) 


This clearly is a linear regression model structure. 

è Use physical insight. A few moments reflection, using high school physics. 
often may reveal which are the essential nonlinearities in a system. This will 
suggest which regressors to trv in (5.13). See Example 1.5. We call this semi- 
physical modeling. It could be as simple as taking the product of voltage and 
current measurements to create a power signal. See Example 5.1 below for 
another simple case. 


Example 5.1 A Solar-Heated House 


Consider the problem of identifying the dynamics of a solar-heated house. described 
in Example 1.1. We need a model of how the storage temperature y(ż) ts affected by 
pump velocity and solar intensitv. A straightforward hnear model of the type (4.7) 
would be 


y) + ayt — 1) + aov(t — 2) 
= butt —1) + bult — 2) + cft —-DY + colt —2) (5.16) 


With this we have not used any physical insight into the heating process, but in- 
troduced the black-box model (5.16) in an ad hoc manner. A moment's reflection 
reveals that a linear model is not very realistic. Clearly, the effects of solar intensity 
and pump velocity are not additive. When the pump is off. the sun does not at all 
affect the storage temperature. 

Let us go through what happens in the heating system. Introduce x(t) for 
the temperature of the solar pane! collector at time ¢. With some simphifications, 
the physics can be described as follows in discrete time: The heating of the air in 
the collector [= x(t + 1) — x(t)] is equal to heat supplied by the sun [= d2 - /(1)] 
minus Joss of heat to the environment [= d; - x(t)] minus the heat transported to 
storage [= do - x(t) - u(r)]; that is, 


x(t +1) — x(t) = dol (t) — d3x(t) — dox(t) - u(t) (5.17) 


In the same way, the increase of storage temperature [= y(t + 1) — y(t)] is equal to 
supplied heat [= dyx(r) - u(t)| minus losses to the environment [= d, y¥(¢)]; that is. 


y(t + 1) — yv) = dox(t)u(t) — dyy(t) (5.18) 


In equations (5.17) and (5.18) the coefficients d} are unknown constants, whose 
numerical values are to be determined. The temperature x(t) is not. however, mea- 
sured, so we first eliminate x(t) from (5.17) and (5.18). This gives 


(== tapi e 
u(t — 2) 


$d 1a =a = i dn = DIG 296.9) 
u(t — 2) 


dult — Iyt — 1) + dol — di ut — I y(t — 2) 


! 
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The relationship between the measured quantities y, u, and / and the parameters 
di is now more complicated. It can be simplified by reparametrization: 


Q = (1 -d) gilt) = y(t — 1) 
y(t — 1l)u(t — 1) 
@ = (1 — d;) plt) = uea 
y(t — 2)u(t — 1) 
& = (d; — 1)(1 — dı) g3(t) = rs) ee 
6, = ddz galt) = u(t — I)i (t — 2) (5.20) 
b5 = —do pst} = u(t — 1)y(t — 1) 
0s = d(l — di) Go(t) = u(t — 1l)¥(t — 2) 
07 = [a &...46] et) = [Al galt)... g6(t) ] 
Then (5.19) can be rewritten as a true linear regression, 
y(t) = $O) = p'e (5.21) 


where we have a linear relationship between the new parameters @ and the con- 
structed measurements y(t). (Notice that g does not depend on 8.) The price for 
this is that the knowledge of algebraic relationships between the 6;, according to 
(5.20), has been lost. The simple structure (5.21) turns out to give a reasonably good 
model, Ljung and Glad (1994a). 


5.3 NONLINEAR STATE-SPACE MODELS 


A General Model Set 
The most general description of a finite-dimensional system is 


x(t +1) = f@. x(t), u(t), w(t); 0) 
y(t) = A(t. x(t), u(t), v(t); @) 


Here w(t) and v(t) are sequences of independent random variables and @ denotes 
a vector of unknown parameters. The problem to determine a predictor based on 
(5.22) and on formal probabilistic grounds is substantial. In fact, this nonlinear 
prediction problem is known to have no finite-dimensional solution except in some 
isolated special cases. 

Nevertheless, predictors for (5.22) can of course be constructed, either with 
ad hoc approaches or by some approximation of the unrealizable optimal solution. 
For the latter problem, there is abundant literature (see, e.g., Jazwinski, 1970, or 
Anderson and Moore, 1979). In either case the resulting predictor takes the form 


$10) = g(t, Z'~'; 0) (5.23) 
Here, for easier notation, we introduced 


= (y'u) = (y), 4)... v(t), u(t) 


(5.22) 
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to denote the input-output measurements available at time ¢. This is the form in 
which the model is put to use for identification purposes. We may thus view (5.23) as 
the basic model, and disregard the route that took us from the underlying description 
[like (5.22)] to the form (5.23). This is consistent with the view of models as predictors 
that we took in Chapter 4: the only difference is that (5.23) ts a nonlinear function 
of past data, rather than a linear one. Just as in Chapter 4. the model (5.23) can be 
complemented with assumptions about the associated prediction error 


e(t.0) = v(t) — g(t. Z'!: 6) (5.24) 


such as its covariance matrix. A(t: 8) or its PDF f(x, ¢t: @). 


Nonlinear Simulation Model 


A particularly simple way of deriving a predictor from (5.22) is to disregard the 
process noise w(t) and take 


x(t + 1.0) = f(t, x(t. 0). u(t). 0: 6) 


; (5.25) 
FIO = h(t, x(t. 6). u(t). 0: 9) 


We call such a predictor a simulation model. since +(t|@) is constructed by simulating 
a noise-free model (5.22) using the actual input. Clearly, a simulation model is almost 
as easy to use starting from a continuous-time representation: 


d 
Wt?) = f(t. x(t,9). u(t), 0:6) (5.26) 
FIO) = h(t. x(t, 6). u(t), 0; 8) 


Example 5.2 Delignification 


Consider the problem of reducing the lignin content of wood chips in a chemical 
mixture. This is, basically, the process of cellulose cooking for obtaining pulp for 
paper making. 

Introduce the following notation: 


x(t): lignin concentration at time t 
u(t): absolute temperature at time t 
u(t): concentration of hydrogen sulfite. [HSO; ] 
u(t): concentration of hydrogen, [H*] 


Then basic chemical laws tell us that 


d ' 

TO = he PQ] > [ua] - [ua (e)]f - (5.27) 
Here E; isthe Arrhenius constant and kı, m.a.and £$ are other constants associated 
with the reaction. Simulating (5.27) with the measured values of {u; (t), i = 1, 2. 3} 
for given values of 67 = [E,. kı. m. æ. P) gives a sequence of corresponding lignin 
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concentrations {x(f,@)}. In this case the system output is also the lignin concen- 
tration. so ¥(t|@) = x(t.@). These predicted, or simulated, values can then be 
compared with the actually measured values so that the errors associated with a 
particular value of 0 can be evaluated. Such an application is described in detail in 
Hagberg and Schöön (1974). 5 


5.4 NONLINEAR BLACK-BOX MODELS: BASIC PRINCIPLES 


As we have seen before. a model is a mapping from past data to the space of the 
output. In the nonlinear case, this mapping has the general structure (5.23): 


F110) = g(Z!.8) (5.28) 


We here omit the explicit time dependence. If we have no particular insight into the 
system's properties. we should seek parameterizations of g that are flexible and cover 
“all kinds of reasonable system behavior.” This would give us a nonlinear black-box 
model structure—a nonlinear counterpart of the linear black models discussed in 
Section 4.2. How to achieve such parameterizations is the topic of the present and 
following sections. 


A Structure for the General Mapping: Regressors 


Now, the model structure family (5.28) is really too general, and it is useful to write 
g as a concatenation of two mappings: one that takes the increasing number of 
past observations Z’ and maps them into a finite dimensional vector g(t) of fixed 
dimension, and one that takes this vector to the space of the outputs 


g(Z'".8) = g(p(t),O) ? (5.29) 

where 
g(t) = 9(Z'") (5.30) 
We shall as before call this vector gy the regression vector and its components will 


be referred to as regressors. We may allow also the more general case that the 
formation of the regressors ts itself parameterized 


g(t) = yZ"! 6) (5.31) 


which we for short write g(t. 0), see. e.g.. (4.40). For simplicity, the extra argument 
0 will however be used explicitly only when essential for the discussion. 

The choice of the nonlinear mapping in (5.28) has thus been decomposed into 
two partial problems for dynamical systems: 


1. How to choose the regression vector y(t) from past inputs and outputs. 


2. How to choose the nonlinear mapping g(¢. 0) from the regressor space to the 
output space. 
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The choice of regressors is of course application dependent. Typically the regressors 
are chosen in the same way as for linear models: past measurements and possibly 
past model outputs and past prediction errors. as in (4.11). (4.20). and (4.40). We 
would then. analogously to Table 4.1 talk about NARX, NARMAX. NOE model 
structures, etc. For the nonlinear black-box case. it is most common to use only 
measured (not estimated) quantities in the regressors, i.e.. NFIR and NARX. 


Basic Features of Function Expansions and Basis Functions 


Now let us turn to the nonlinear mapping e(y. 6) which for any given @ maps R@ 
to R”. In the sequel, for simplicity we shall take p = 1. i.e. we deal only with 
scalar outputs. At this point it does not matter how the regression vector g = 
[pi ... Ga j: is constructed. It is just a vector that lives in R”. 

It is natural to think of the parameterized function family as function expan- 
sions: 


(9,9) = Y argilp) @=[a ... an)! (5.32) 
k=l 


We refer to g} as basis functions, since the role they play in (5.32) is similar to that 
of a functional space basis. We are going to show that the expansion (5.32). with 
different basis functions. plays the role of a unified framework for investigating most 
known nonlinear black-box model structures. 

Now, the key question is: How to choose the basis functions g;? An immediate 
choice might be to try Taylor expansion 


glp) = oF (5.33) 


where for d > 1 we would have to interpret y* as the variables that cover all products 
of the components g; with summed exponents being k. Such expansions are called 
Volterra expansions in the modeling application, and have long been tried. 

The expansions that have attracted most of the recent interest are however of 
a different kind. and the following facts are essential to understand the connections 
between most known nonlinear black-box model! structures: 


e All the g} are formed from one “mother basis function” that we generically 
denote by x(x). 
e This function « (x) is a function of a scalar variable x. 


e Typically g; are dilated (scaled) and translated versions of x. For the scalar 
case d = 1 we may write 


Bl) = Bele. Br. va) = KB — Yk) (5.34) 
We thus use f; to denote the dilation parameters and yg to denote translation 
parameters. 


A scalar example: Fourier series. Take «(x) = cos(x). Then (5.32). (5.34) will be 
the Fourier series expansion, with £g as the frequencies and yz as the phases. 
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Another scalar example: Piece-wise constant functions. Take « as the unit interval 
indicator function 
_ fl for0 < x < 1 
k(x) = 
0 else 
and take, for example. ye = kA. By = 1/A and a, = f (kA). Then (5.32), (5.34) 
gives a piece-wise constant approximation of any function f over intervals of length 


A. Clearly we would have obtained a quite similar result by a smooth version of the 
indicator function, e.g., the Gaussian bell: 


(5.35) 


l RF. 
K(x) = Te f2 (5.36) 


A variant of the piece-wise constant case. Take « to be the unit step function 


0 forx < 0 
1 forx > 0 


K(x) = | 


We then just have a variant of (5.35), since the indicator function can be obtained as 
the difference of two steps. A smooth version of the step, like the sigmoid function 


will of course give quite similar results. 


Classification of Single- Variable Basis Functions 


Two classes of single-variable basis functions can be distinguished depending on their 
nature: 


e Local Basis Functions are functions, where the significant variation takes place 
in local environment. 


e Global Basis Functions are functions that have significant variation over the 
whole real axis. 


Clearly the Fourier series and the Volterra expansions are examples of a global basis 
function, while (5.35)-(5.38) are all local functions. 


Construction of Multi- Variable Basis Functions 


In the multi-dimensional case (d > 1). gą is a function of several variables. In 
most nonlinear black-box models, it is constructed from the single-variable function 
K in some simple manner. There are three common methods for constructing multi- 
variable basis functions from single-variable basis functions. 


1. Tensor product. The tensor product is obtained by taking the product of the 
single-variable function, applied to each component of g. This means that the 
basis functions are constructed from the scalar function « as 

d ` . 
grlo) = gilo. Br ve) = | ] «(BiG — vi) (5.39) 
j=l 
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2. Radial construction. The idea is to let the value of the function depend only 
on g's distance from a given center point 


gelo) = gelo. Br. ve) = K(IlP — Yella) (5.40) 


where || - ||s, denotes any chosen norm on the space of the regression vector 
y. The norm could typically be a quadratic norm 


2 T Š 

lelg, = @ Bre (5.41) 

with £x as a positive definite matrix of dilation (scale) parameters. In simple 
cases f; may be just scaled versions of the identity matrix. 


3. Ridge construction. The idea is to let the function value depend only on g's 
distance from a given hyperplane: 


gelo) = gel. Br. yr) = K(Blg +n). p © R? (5.42) 


The ridge function is thus constant for all g in the hyperplane {p € R° : 
p7 y = constant}. As a consequence, even if the mother basis function x 
has local support, the basis functions g will have unbounded support in this 
subspace. 


Approximation issues 


For any of the described choices the resulting model becomes 


n 
glo. 0) = $ ax (Bly — %)) (5.43) 
k=1 
with the different interpretations of the argument (o — yg) just discussed. The 
expansion is entirely determined by: 


e The scalar valued function « (x) of a scalar variable x. 
e The way the basis functions are expanded to depend on a vector ¢. 


The parameterization in terms of @ can be characterized by three types of parameters: 


è The coordinates a 
e The scale or dilation parameters B 
e The location parameters y 


These three parameter types affect the model in quite different ways. The coordinates 
enter linearly, which means that (5.43) is a linear regression for fixed scale and location 
parameters. 

A key issue is how well the function expansion is capable of approximating any 
possible “true system” gy(g). There is rather extensive literature on this subject. 
For an identification oriented survey, see. e.g., Juditsky et.al. (1995). 

The bottom line is easy: For almost any choice of x (x)—except being a polv- 
nomial—the expansion (5.43) can approximate any “reasonable” function go(q) ar- 
bitrarily well for sufficiently large n. 
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It is not difficult to understand this. It is sufficient to check that the delta 
function—or the indicator function for arbitrarily small areas— can be arbitrarily 
well approximated within the expansion. Then clearly all reasonable functions can 
also be approximated. For a local x with radial construction this is immediate: It 
is itself a delta function approximator. For the ridge construction this is somewhat 
more tricky and has been proved by Cybenko (1989). Intuitively we may think of a 
fixed yy = y and “many” hyperplanes fx that all intersect at y and are such that 
K is nonzero only “close” to the hyperplane (norm of £ is large). Then most of the 
support of the resulting function will be at y. 

The question of how efficient the expansion is. i.e.. how large n is required 
to achieve a certain degree of approximation, is more difficult. and has no general 
answer. See, e.g. Barron (1993). We may point to the following aspects: 


i. If the scale and location parameters 6 and y are allowed to depend on the 
function gy to be approximated, then the number of terms n required for a 
certain degree of approximation is much less than if By. yk, k = 1.... is ana 
priori fixed sequence. To realize this, consider the following simple example: 


Example 5.3 Constant Functions 


Suppose we use piece-wise constant functions to approximate any scalar valued func- 
tion of a scalar variable g: 


gy), OS958 (5.44) 
nlp) = X an (Bi(y — v) (5.45) 
k=1 


where x is the unit indicator function (5.35). Let us Suppose that the require- 
ment is [sup, |go(~) — 8n(p)| < €] and we know a bound on the derivative of 
8o:sup |2o(y)| < C. For (5.45) to be able to deliver such a good approximation for 
any such function gg we need to take x, = kA, By = 1/A with A < 2e/C. i.e., we 
need n > CB/(2e€). That is, we need a fine grid A that is prepared for the worst 
case |gq(y)| = C at any point. 

If the actual function to be approximated turns out to be much smoother, and 
has a large derivative only in a small interval, we can adapt the choice of By = 1/ A} 
so that A, ~ 2€/|g,(¢*)| for the interval around yg = kA, © g* which may give 
the desired degree of approximation with much fewer terms in the expansion. C 


ii. For the local, radial approach the number of terms required to achieve a certain 
degree of approximation ô of a s times differentiable function is proportional 
to 


n 5<1 (5.46) 


` d/s) ` 


See Problem 5G.2. It thus increases exponentially with the number of regres- 
sors. This is often referred to as the curse of dimensionality. 
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Network Questions for Nonlinear Black-Box Structures (*) 


So far we have viewed the model structures as basis function expansions. where 
the basis functions may contain adjustable parameters. Such structures are often 
referred to as networks, primarily since typically one “mother basic function“ x is 
repeated a large number of times in the expansion. Graphical illustrations of the 
structure therefore look like networks. 


Multi-Layer Networks. The network aspect of the function expansion is even more 
pronounced if the basic mappings are convolved with each other in the following 
manner: Let the outputs of the basis functions be denoted by 


PD = gelol) = K(G(t). Br. Ve) 


and collect them into a vector: 
p?) = loc. ae 0(0)| l 


o . i : z 2 
Now. instead of taking a linear combination of these Pi '(t) as the output of the 
model (as in (5.32)), we could treat them as new regressors and insert them into 
another “layer” of basis functions forming a second expansion 


glp.) = doa eg, By) (5.47) 
i 


. . 9 ? 
where 6 denotes the whole collection of involved parameters: By. Yk. a”, p”. yi k 


Within neural network terminology, (5.47) is called a two-hidden layer network. 
The basis functions «x (g(t). By. yk) then constitute the first hidden layer. while 


klo”. B y’) give the second one. The layers are “hidden” because they do not 
show up explicitly in the output g (y. @) in (5.47), but they are of course available to 
the user, See Figure 5.2 for an illustration. Clearly we can repeat the procedure an 
arbitrary number of times to produce multi-layer networks. This term is primarily 
used for sigmoid neural networks, but applies as such to any basis function expansion 
(5.32). 


Input layer Hidden layers Output layer 


Figure 5.2 Feedforward network with two hidden layers. 


154 Chap.5 Models for Time-varying and Nonlinear Systems 


The question of how many lavers to use however is not easy. In principle. with 
many basis functions. one hidden laver ts sufficient for modeling most practically 
reasonable systems. Sontag (1993)}contains many useful and interesting insights into 
the importance of second hidden layers in the nonlinear structure. 


Recurrent Networks. Another very important concept for applications to dvnam- 
ical systems is that of recurrent networks. This refers to the situation that some of 
the regressors used at time f are outputs from the model structure at previous time 
instants: 

plt) = glo(t — k). 6) 


See the illustration in Figure 5.3. It can also be the case that some component ¢, (r) 
of the regressor at time f¢ is obtained as a value from some interior node (not just at 
the output laver) at a previous time instant. Such model dependent regressors make 
the structure considerably more complex. but offer at the same time quite useful 
flexibility. The regression vectors (4.20) and (4.40) are examples where previous 
model outputs and other internal signals are used as regressors. 


Figure 5.3 Example of a recurrent network. q~: delays the signal by one time 
sample. 


One might distinguish between input/output based networks and state-space 
based networks, although the difference 1s less distinct in the nonlinear case. The 
former would be using only past outputs from the network as recurrent regressors. 
while the latter may feed back any interior point in the network to the input layer asa 
recurrent regressor. Experience with state-space based networks is quite favorable. 
e.g., Nerrand et.al. (1993) . 


Estimation Aspects 


An important reason for dealing with these nonlinear black-box models in paralle! 
with other models is that the estimation theory, the asymptotic properties and the 
basic algorithms are the same as for the other model structures discussed in this book. 
We shall return to special features in algorithms and methods that are particular to 
nonlinear black boxes, but most of the discussion in Chapters 7 and onwards applies 
also to these structures. 


5.5 NONLINEAR BLACK-BOX MODELS: NEURAL NETWORKS, 
WAVELETS AND CLASSICAL MODELS 


In this section we shall briefly review some popular model structures. They all have 
the general form of function expansions (5.43). and are composed of basis func- 
tions g; obtained by parameterizing some particular “mother basis function” k as 
described in the previous section. 
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Neural Networks 


Neural networks have become a very popular choice of model structure in recent 
years. The name refers to certain structural similarities with the neural synapse 
system in animals. From our perspective these models correspond to certain choices 
in the general function expansion. 


Sigmoid Neural Networks. The combination of the model expansion (5.32). with a 
ridge basis function (5.42) and the sigmoid choice (5.38) for mother function, gives 
the celebrated one hidden laver feedforward sigmoid neural net. 


Wavelet and Radial Basis Networks. The combination of the Gaussian bell type 
mother function (5.36) and the radial construction (5.40) is found in both wavelet 
networks. Zhang and Benveniste (1992). and radial basis neural networks, Poggio 
and Girosi (1990). 


Wavelets 


Wavelet decomposition is a typical example for the use of local basis functions. 
Loosely speaking. the “mother basis function” (usually referred to as mother wavelet 
in the wavelet literature. and there denoted by w rather than x) is dilated and trans- 
lated to form a wavelet basis. In this context it is common to let the expansion (5.32) 
be doubly indexed according to scale and location, and use the specific choices (for 
one dimensional case) Bj = 2/ and yg = 277k. This gives, in our notation, 


grle) = K(2/p — k), j.k positive integers (5.48) 


Compared to the simple example of a piece-wise constant function approxima- 
tion in (5.35), we have here multi-resolution capabilities, so that the intervals are 
multiply covered using basis functions of different resolutions {i.e. different scale 
parameters). With suitably chosen mother wavelet and appropriate translation and 
dilation parameters, the wavelet basis can be made orthonormal. which makes it easy 
to compute the coordinates æj} in (5.32). 

The multi-variable wavelet functions can be constructed by tensor products of 
scalar wavelet functions, but other constructions are also possible. However, the 
burden in computation and storage of wavelet basis functions rapidly increases with 
the regressor dimension d, so the use of these orthonormal expansions are in practice 
limited to the case of few regressors (d < 3). 


“Classical” Nonlinear Black-Box Models 


Kernel Estimators. Another well known example for use of local basis functions 
is Kernel estimators, Nadaraya (1964), Watson (1969). A kernel function «(-) is 
typically a bell-shaped function, and the kernel estimator has the form 


alg) = >i x «( 
k=l | 


P — Yk 


h 


) (5.49) 
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where h is a small positive number. yg are given points in the space of regression 
vector gy. This clearly is a special case of (5.32). (5.34), with one fixed scale parameter 
h = 1/6 for all the basis functions. This scale parameter is typically tuned to the 
problem. though. A common choice of « in this case is the Epanechnikov kernel: 


1—x° forix] < 1 
= | 0  for|xj > 1 
Nearest Neighbors or Interpolation. The nearest neighbor approach to identifica- 
tion has a strong intuitive appeal: When encountered with a new value g* we look 
into the past data Z N to find that regression vector y(t) = y that is closest to g*. 
We then associate the regressor y* with the measurement v(t} corresponding to 
g(t) =g". 

This corresponds to using (5.43) with « as the indicator function (5.35). ex- 
panded to a hypercube by the radial approach (5.40). The location and scale param- 
eters in (5.40) are chosen such that the cubes x (|l — yells, } are tightly laid such that 
exactly each data point (k) € Z^ falls at the center of one cube (yz = y(k)). The 
corresponding coordinate œ; will be the value of y(k) at this data point. The number 
of terms in the expansion will then equal the number of data points in Z* . The value 
of g(¢*, @) will now equal a, = y(t) for that ¢ for which «(B,(y* — y,)) = 1, i.e. 
for that g(t) which is closest to y*. That gives the nearest neighbor approach. 


B-Splines. B-splines are local basis functions which are piece-wise polynomials. 
The connections of the pieces of polynomials have continuous derivatives up to a 
certain order, depending on the degree of the polynomials, De Boor (1978). Schu- 
maker (1981). Splines are very nice functions. since they are computationally very 
sumple and can be made as smooth as desired. For these reasons, they have been 


widely used in classic interpolation problems. p 
f 


5.6 FUZZY MODELS 


For complex systems it may be difficult to set up precise mathematical models. This 
has led to so-called fuzzy models. which are based on verbal and imprecise descrip- 
tions on the relationships between the measured signals in a system, Zadeh (1975). 
The fuzzy models typically consist of so-called rule bases, but can be cast exactly into 
the framework of model structures of the class (5.32). In this case. the basis functions 
g are constructed from the fuzzy set membership functions and inference rules of 
combining fuzzy rules and how to “defuzzifv” the results. When the fuzzy models 
contain parameters to be adjusted. they are also called neuro-fuzzy models. Jang 
and Sun (1995). In this section, we shall give a simplified description of how this 
works. 


Fuzzy Rule Bases as Models 


Fuzzy Rules. A fuzzy rule base isa collection of statements about some relationship 
between measured variables, like 
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e If it is cold outdoors and the heater has low power, then 
the room will be cold 


e If it is cold outdoors and the heater has high power, then 
room temperature will be normal 


e If it is hot outdoors and the heater has high power, then 
the room will be hot 


We can think of this as a verbal model relating the “regressors.” gı =outdoor temper- 
ature and g=heater power, to the variable to be explained, y=room temperature. 
The question then is, what does this tell us when we know that the outdoor temper- 
ature is 7°C and the heater is set to 100 W? To handle that, we need to: 


1. Quantify what “cold outdoors” means, related to a particular temperature read- 
ing, and similarly for the heater power. 


2. Decide which rules are applicable, and how to combine their conclusions to 
come up with a statement about the room temperature. 


We shall now discuss each of these questions. 


Fuzzy Sets and Membership Functions. Each of the regressors that is used in the 
rule base is associated with a number of attributes, like for the outdoor temperature, 
“cold.” “nice,” “hot”. These attributes are seen as sets, to which the temperature can 
belong “to a certain degree.” This is the fuzzy nature of the sets. Each attribute— 
each fuzzy set—is associated with a membership function that, for a particular value 
of the variable. describes the “degree of membership” to this set. If the attribute 
for the variable g is denoted by A. the membership function is usually denoted by 
u a(o). and will be a function assuming values between 0 and 1. Typical membership 
functions are built up by piece-wise linear functions like a “soft step” 


1 forx < —1 
K(x) = {1 — (x + 1)/2 for-1 <x <1 (5.51) 
0 fori < x 


or a triangle 


for |x 


0 
k(x) = | — |x] for|x 


1 
| A (5.52) 


Clearly by scale and location parameters. the step and the peak can be placed any- 
where and be of arbitrary width: 


Halo) = Ki (bl — y)) (5.53) 


the 
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We shall here only deal with attributes and membership functions, such that the 
degree of memberships to the different attributes associated with a given variable 
always sum up to one. That is, if gy has r attributes Aj, i = 1...., r. then the 
membership functions should be subject to 


X uap = 1, Ve (5.54) 


i=] 


Such a collection of sets and membership functions is called a strong fuzzy parti- 
tion. 


For the outdoor temperature, e.g., we could associate the attribute “cold” with 
the membership function Heo le) = K1(0.2(¢ — 10)), that is, any temperature below 
57C is definitely “cold.” and any temperature above 15°C is definitely “not cold.” 
while temperatures in between are cold to varying degrees. Similarly, we could have 
nice (Y) = K2(0.1(@ — 15)) and holy) = «1(—0.2(¢@ — 20)). This is illustrated in 
Figure 5.4. It is clearly a strong fuzzy partition. 


5 10 15 20 ; 25 


Figure 5.4 Membership functions for cold (solid line). nice (dashed line), and 
hot (dotted line). 


Combining Rules. We can write down a fuzzy rule base, relating a regression vector 
¢ to an output variable y, more formally as 


if (@, is A11)... and (gg is Ay.a) then (y is By) 


if (pı is Api)... and (gq is Apa) then (y is Bp) 


where the fuzzy sets A; ; are doubly indexed: / is the index of the regressor variable 
(measurement). and j is the index of the rule. We denote the membership functions 
by ua; (pi) and upg, (y). respectively. 

We shall assume that the rule base is such that there is a rule that covers each 
possible combination of attributes for the different regressors. See Example 5.4 
below. This means in particular that the number of rules. p. must be equal to the 
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product of the number of attributes associated with each regressor. Such a rule base 
is called complete. This means in particular that 


Y [lean = 1 (5.56) 


j=. i=l 


(see Problem 5E.3.) The joint membership function for all the conditions of rule 
number j could rather naturally be taken as the product of the individual ones: 


d 
ua (o) = [ [uao (5.57) 


i=] 


This number could be seen as a “measure of the applicability” of rule number / 
for the given value of g. With this rule we could associate a numerical value of v 
associated with the conclusion B; of rule j. This number—let it be denoted by a; — 
could be the center of mass of the membership function yu g, or the value for which 
the membership function has its peak. In any case it is natural to associate the value 


P 
y= ĵ ojpa,, (o) (5.58) 
jE 


with the regressors gy. In view of (5.56) and (5.57). this is a weighted mean of the 
“average conclusion” of each of the rules, weighted by the applicability of the rule. 


Example 5.4 A DC-Motor 


Consider an electric motor with input voltage u and output angular velocity v. We 
would like to explain how the angular velocity at time f, i.e. v(t). depends on the 
applied voltage u(t? — 1) and the velocity at the previous time sample. That is, 
we are using the regressors (1) = [y,(t). @(t)}]’ . where g(t) = u(t — 1) and 
p(t) = y(t — 1). Let us now device a rule base of the kind (5.55). where we 
choose A; and A2. to be “low voltage.” and A3 and A4. to be “high voltage.” 
We choose A1 2 and A32 to be “slow speed.” while A22 and A4.2 are “fast speed.” 
The membership function for “low voltage” is taken as ua; (91) = K1(0.5(; — 3)). 
with xı defined by (5.51). The membership function for “high voltage” is taken as 
Has =1—pa,. = K1(—0.5(¢ — 3)). The membership functions for slow and fast 
speed are chosen analogously, with breaking points 8 and 15 rad/sec. See Figure 5.5. 
The statements B; about the output are chosen to be “slow,” “medium,” and “fast.” 
with membership functions that are triangular shaped like «2 in (5.52) with peaks at 
5. 10, and 20 rad/sec. We thus obtain a rule base: 


If y(t) is low and ¢2(/) is slow then y(t) is slow. 
If (t) is low and g:(f) is fast then y(t) is medium 
If y(t} is high and y(t) is slow then y(f) is medium 
If g(t) is high and (t) is fast then \(f) is fast. 
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wt- 1) high t- li low 


i wea yd- ly slow 


Figure 5.5 Membership functions for the DC motor. 


Let us see what prediction of v(t) the rule base gives, in case y(t) = [4. 1777. 
This means that the voltage is “low” with a membership degree of 0.25. while it is 
“high” with a degree of 0.75. The past velocity g2(t) = v(t — 1) is “fast” toa degree 
of 1 and “slow” to a degree of 0. The four rules in the rule base then are weighted 
together according to (5.58) as 


y(t) = 5 - (0.25 - 0) + 10 - (0.25 - 1) + 10 - (0.75 - 0) + 20- (0.75 - 1) = 


Back to the General Black-Box Formulation 


Now, if some or all of the rules in the rule base need “tuning.” we may introduce 
parameters to be tuned. The parameters could. e.g.. be some or all of the (defuzzi- 
fied) “conclusions” a, of the rules. They could also be parameters associated with 
the membership functions 44,,(¢;) for the regressors, typically scale and location 
parameters as in (5.53). 


This means that (5.58) takes the form f 
y = 8(p.0) = Ý argilo. B.Y) (5.59) 
k=1 


where the “basis functions” g} are obtained from the parameterized membership 
functions as 
d r e 
gelo By) = | [Be - rD (5.60) 
j=l 


We are thus back to the basic situation of (5.32) and (5.34), where the expansion into 
the d-dimensional regressor space is obtained by the tensor product construction 
(5.39). 

Normally, not all of the parameters œg, B, and y; should be freely adjustable. 
For example. the requirement (5.56) imposes certain constraints on the scale and 
location parameters. If the fuzzy partition ts fixed and not adjustable. i.e.. 6 and y 
fixed, then we get a particular case of the kernel estimate (5.49). which is also a linear 
regression model. 
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Thus fuzzy models are just particular instances of the general model struc- 
ture (5.32). 

One of the major potential advantages is that the fuzzy rule basis may give 
functions g; that reflect (verbal) physical insight about the system. This may be useful 
also to come up with reasonable initial values of the dilation and location parameters 
to be estimated. One should realize, though. that the knowledge encoded in a fuzzy 
rule base may be nullified if many parameters are let loose. For example. in the 
DC-motor rule base, if all the numerical values for the motor speed (5, 10, 10, and 
20) are replaced by free parameters, which are then estimated, our knowledge about 
what causes high speed has not been utilized by the resulting model structure. 


5.7 FORMAL CHARACTERIZATION OF MODELS (+) 


In this section we shall give a counterpart of the discussion of Section 4.5 for gen- 
eral, possibly time-varying. and nonlinear models. We assume that the output is 
p-dimensional and that the input is m-dimensional. Z' denotes, as before. the input- 
output data up to and including time ż. 


Models 


A model m of a dynamical system is a sequence of functions g(t. Z'~!), t = 
oe from R x R?"~) x R”0-D to RP, representing a way of guessing or 
predicting the output \(¢) from past data: 


(tlt — 1) = 8m(t, Z'~") (5.61) 

A model that defines only the predictor function is called a predictor model. 

When (5.61) is complemented with the conditional (given Z'~1) probability density 
function (CPDF) of the associated prediction errors 


fe(x.t, Z1) : CPDFof y(t) — $(tlt — 1). given Z’“! (5.62) 


we call the model a complete probabilistic model. A typical model assumption is 
that the prediction errors are independent. Then f; does not depend on Z'~!: 


fe(x.t) : PDF of y(t) — y(r|t — 1). these errors independent (5.63) 


Sometimes one may prefer not to specify the complete PDF but only its second 
moment (the covariance matrix): 


Am(f) : covariance matrix of v(t) — $(t|* — 1) 
these errors independent (5.64) 


A model (5.61) together with (5.64) could be called a partial probabilistic model. 
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A model can further be classified according to the following properties. 


1. The model m is said to be linear if g(t. Z'~') is linear in y'~! and u'~!: 
gm(t. ZT) = WOSE + Wr (que) (5.65) 


2. A model m is said to be time invariant if gm(t. Z!-}) is invariant under a shift 
of absolute time. If Am or fe is specified, it is further required that thev be 
independent of t. 

3. A model ™ is said to be a k-step-ahead predictor if gm(t. Z'~') is a function 
of y'—*. ut} only. 

4. A model m is said to be a simulation model or an output error model if 
2m(t. Z'—') is a function of u’~! only. 


Analogously to the linear case, we could define the stability of the predictor function 
and equality between different models [see (4.116)]. We refrain, however, from 
elaborating on these points here. 


Model Sets and Model Structures 
Sets of models M* as well as model structures ™ as differentiable mappings 


M: 0 > g(t. Z'':0) © M*:0 € Dy CR? (5.66) 


[and A(t; 6) or f(x. t; 9) if applicable] from subsets of R to model sets can be de- 
fined analogously to Definition 4.3. Once equality between models has been defined. 
identifiability concepts can be developed as in Section 4.5. 

We shall say that a model structure M is a linear regression if Dm = Rf and 
the predictor function is a linear (or affine) function of 8: 


g(t. Z°1:0) = gt. ZTO + wr. ZT) (5.67) 


Another View of Models (*) 


The definition of models as predictors is a rather pragmatic way of approaching the 
model concept. A more abstract line of thought can be developed as follows. 

As users, we communicate with the system only through the input-output data 
sequences Z! = (y, u’). Therefore, any assumption about the properties of the 
system will be an assumption about Z'. We could thus say that 


A model of a system is an assumed relationship for Z’. t = 1,2..... (5.68) 


Often, experiments on a system are not exactly reproducible. For a given input 
sequence u™ , we may obtain different output sequences x at different experiments 
due to the presence of various disturbances. In such cases it is natural to regard y’ as 
a random variable of which we observe different realizations. A model of the system 
would then be a description of the probabilistic properties of Z' (or. perhaps, of 1". 
given u’). This model m could be formulated in terms of a probability measure Pp: 


or the probability density function (PDF) for Z‘: 
Sint. Z') (5.69) 
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That is, 
Pm(Z' € B) = f SF m(t. x dx! (5.70) 
x'ER 


Sometimes it is preferable to consider the input u’ as a given deterministic sequence 
and focus attention on the conditional PDF of +", given wu’: 


F(t. y'u’) (5.71) 


A model (5.69) or (5.71) would normally be quite awkward to construct and 
work with. and other, indirect ways of forming Fn will be preferred. Indeed, the 
stochastic models of Sections 4.2 and 4.3 are implicit descriptions of the probabil- 
ity density function for the measured signals. The introduction of unmeasurable. 
stochastic disturbances {u:(r)}. {e(t)}, and so on, is a convenient way of describing 
the probabilistic properties of the observed signal, and also often corresponds to an 
intuitive feeling for how the output is generated. It is, however, worth noting that 
the effect of these unmeasurable disturbances in the model is just to define the PDF 
for the observed signals. 

The assumed PDF Fa in (5.69) is in a sense the most general model that can 
be applied for an observed data record y’. u’. It includes deterministic models as a 
special case. It also corresponds to a general statistical problem: how to describe the 
properties of an observed data vector. For our current purposes, it is, however, not 
a suitably structured model. The natural direction of time flow in the data record, as 
well as the notions of causality. is not present in (5.69). 

Given f,,(t. Z') in (5.69). it is. at least conceptually, possible to compute the 
conditional mean of y(t) given yf}, u’—!; that is, 


Salt =~ DY = Em[y@|y w] = gmlt, Z’) (5.72) 


and the distribution of y(t) — gm(t. Z'“'), say f,(x.t, Z'~!). From (5.69) we can 
thus compute a model (5.61) along with a CPDF fe in (5.62). Conversely, given 
the predictor function g(t. Z'~') and an assumed PDF f,(x. t) for the associated 
prediction errors. we can calculate the joint PDF for the data y’, u' as in (5.69). This 
follows from the following lemma: 


Lemma 5.1. Suppose that u’ is a given, deterministic sequence, and assume that 
the generation of y'is described by the model 


Y) = gmlt. Z'') + Em(t) (5.73) 


where the conditional PDF of €m(t) (given y’—!, u’—!) is f.(x. 1). Then the joint 
probability density function for y’, given u’, is 


Fmt. y'u) = [| AOO — gmk. Z4’), k) (5.74) 


k=1 


Here we have, for convenience, denoted the dummy variable x, for the distribution 
of v(k) by v(k) itself. 
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Proof. The output y(t) is generated by (5.73). Hence the CPDF of y(t). given 
Z"! Vis 
PAZIT) = fel, — gmlt. Z'"'). 1) (5.75) 


Using Bayes’s rule (1.10), the joint CPDF of y(t) and y(t — 1). given Z'~ can be 
expressed as 


P (xi. x11127?) 


p(xly@— 1) = xi. Z7?) p (x-1?) 
= fe(x1— emt. ZT). t) fe (Xr — @m(t — 1. Z1?) t —1) 


where y(t — 1) in gm(t. Z'~!) should be replaced by x,—1. Here we have assumed 
u’ to be a given deterministic sequence. Iterating the preceding expression to t = | 
gives the joint probability density function of v(#), y(t — 1), .... y(1), given uw’, that 
is, the function f, in (5.74). = 


The important conclusion from this discussion is that the predictor model (5.61). 
complemented with an assumed PDF for the associated prediction errors, is no more 
and no less general than the general, unstructured joint PDF model (5.69). 


Remark. Notice the slight difference. though, in the conditional PDF for the 
prediction errors. The general form (5.69) may in general lead to a conditional 
PDF that in fact depends on Z'~'; fel(x, t, Z'~') as in (5.62). This means that the 
prediction errors are not necessarily independent, while they do form a martingale 
difference sequence: 


E lem@lem(t = 1). ...3 Em(1)] = 0 (5.76) 


In the predictor formulation (5.73), we assumed the CPDF fe(x.t) not to depend 
on Z‘—', which is an implied assumption of independence of €,(t) on previous data. 
Clearly. though, we could have relaxed that assumption with obvious modifications 
in (5.74) as a result. 


f 


5.8 SUMMARY 


The development of models for nonlinear systems is much like that of linear sys- 
tems. The basic difference from a formal point of view is that the predictor becomes 
a nonlinear function of past observations. The important difference from a prac- 
tical point of view is that the potential richness of possibilities makes unstructured 
black-box-type models much more demanding than in the linear case. It is much 
more important that knowledge about the character of the nonlinearities is built into 
the mode] structures. This can be done in different ways. An ambitious physical 
modeling attempt may lead to a well structured state-space model with unknown 
physical parameters as in (5.26). More leisurely “semi-physical” modeling may lead 
to valuable insights into how the regressors in (5.13) should be formed. Note that the 
physical insight need not be in analytic form. The nonlinearities could very well be 
defined in look-up tables, and the model parameters could be entries in these tables. 

When physical insight is lacking, tt remains to resort to black-box structures. 
as described in Sections 5.4 to 5.6. This contains approaches such as artificial neural 
networks, kernel methods. and fuzzy models. 
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We have also given a short summary, in Section 5.7. of formal aspects of dynam- 
ical systems. We have stressed that a model in the first place is a predictor function 
from past observations to the future output. The predictor function may possibly be 
complemented with a model assumption of properties of the associated prediction 
errors, such as its variance or its PDF. 
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A general discussion of the model concept is given by Willems (1985). 


5.10 PROBLEMS 


§G.1 Consider the following nonlinear structure: 


x(t) = f(x@t—1)...... v(t — n) ult —1)..... u(t — n}: 8) (5.77a) 
ya) = x(t) + v(t) (3.77b) 
u(t} = Hg. @je(t) (5.77c) 
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Here (5.77a) describes the nonlinear noise-free dynamics, parametrized by @. while 
(5.77b) describes the measurements as the noise-free output, corrupted by the noise 
(u(rt)}. which is modeled in the general way (2.19). Show that the natural predictor for 
(5.77) is given by 

f(t) = [1 - Ha. 0)) vr) + ANG. 0x0. 4) 
where x(t. 0) is defined by 


x(t.0} = f (x(t — 1.8).....3 x(t — n, ð) u(t — 1),.... u(t — 2): 0) 


5G.2 To investigate how the smoothness of a function f(x) and the number of arguments, 
dim x, affect how many basis functions are required to approximate it, reason as follows: 
If f(x) is p times differentiable with a bounded pth derivative. we can expand it in 
Taylor series 


FX) = fixo) + (& — x0) fay) +. +e — PPE PO SC 


where the kth derivative f“ is a tensor with d* elements, and the multiplication with 
(x — xo) has to be interpreted accordingly. Suppose that we seek an approximation of 
f over the unit sphere |x] < 1 with an error less than 6. This is to be accomplished 
by local approximations around centerpoints x; of the radial basis type. Show that 
the necessary number of such center points is given by (5.46). and that the number of 
parameters associated with each centerpoint is ~ d?. 


5E.1 Consider the bilinear model structure described by 
x(t) + a(t — 1) + aox(t — 2) = but — 1) + buut — 2) + eyx(t — Dude — 1) 
y(t) = x(t) + v(t) 
where 
G@=[a m b b od’ 


(a) Assume {t'(#)} to be white noise and compute the predictor ¥(r|@) and give an 
expression for it in the pseudolinear regression form 


F110) = v(t. 0)0 


with a suitable vector (2. @). 


(b) Now suppose that {v(f)} is not white. but can be modeled as an (unknown) first- 
order ARMA process. Then suggest a suitable predictor for the system. 


5E.2 Consider the system in Figure 5.1 (upper plot), where the nonlinearity is saturation with 
an unknown gain: 


A, - 0. if u(t) > 6, 
f (u(t)) = 4 Gua), ifju) < & 
—0; i Q. if u(r) < —ĝ- 


Suppose that the linear system can be described by a second-order ARX model. Write 
down, explicitly, the predictor for this model, parametrized in 0}, 6. and the ARX 
parameters. 
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SE.3 Show (5.56) by induction as follows: Suppose first that there is just one regressor, 


ST.2 


associated with k attributes. Then there must be k rules in the rule base. for it to be 
complete. covering all the attributes. Thus (5.56) follows from the assumption (5.54). 
Now suppose that (5.56) holds for d regressors and that there are K rules. Here 
K = k, -kz---kg. with k; as the number of attributes for regressor j. Now. add 
another regressor 4+1 with k44, attributes. subject to (5.54). For the rule base to 
remain complete. it must now be complemented with K - ky) new rules, covering the 
combinations of the previous cases with each attribute of the new regressor. Show the 
induction step. that (5.56) holds also when 9441 has been added. 

Time-continuous bilinear system descriptions are common in many fields (see Mohler. 
1973). A model can be written 


x(t} = A(O)x(z) + B(@)u(t) + GO) x (tut) + wir) (5.78a) 


where xiz) is the state vector. u(t) is white Gaussian noise with variance matrix R}. 
and u(t) is a scalar input. The output of the system is sampled as 


yf) = C(@)x(t) + e(t). fort = kT (5.78b) 


where e(t) is white Gaussian measurement noise with variance Rz. The input is piece- 
wise constant: 
u(t) = uk, kT <t < (kK+1)T 


Derive an expression for the prediction of y ((k + 1)T), given u, and v(rT) forr < k. 
based on the model (5.78). 


Consider the Monod growth model structure 


. A; - X2 

X = 0s aa Xy — YX 

-1 &- x 

Se preg, — a (X2 — a2) 


y=[x x ]7 is measured and a and @ are known constants. Discuss whether the 
parameters 6,, 62 and 6; are identifiable. 


Remark: Although we did not give any formal definition of identifiability for 
nonlinear model structures. they are quite analogous to the definitions in Sections 4.5 
and 4.6. Thus. test whether two different parameter values can give the same input- 
output behavior of the model. 

[See Holmberg and Ranta (1982). xı here is the concentration of the biomass 
that is growing, while x2 is the concentration of the growth limiting substrate. 6, is 
the maximum growth rate, 6 is the Michaelis Menten constant. and 6; is the yield 
coefficient.] 


II methods 
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NONPARAMETRIC TIME- AND 
FREQUENCY-DOMAIN 
METHODS 


A linear time-invariant model can be described by its transfer functions or by the 
corresponding impulse responses, as we found in Chapter 4. In this chapter we shall 
discuss methods that aim at determining these functions by direct techniques without 
first selecting a confined set of possible models. Such methods are often also called 
nonparametric since they do not (explicitly) employ a finite-dimensional parameter 
vector in the search for a best description. We shall discuss the determination of the 
transfer function G(q) from input to output. Section 6.1 deals with time-domain 
methods for this, and Sections 6.2 to 6.4 describe frequency-domain techniques of 
various degrees of sophistication. The determination of H (q) or the disturbance 
spectrum is discussed in Section 6.5. 

{t should be noted that throughout this chapter we assume the system to operate 
in open loop [i.e.. {2(t)} and {u(t)} are independent]. Closed-loop configurations 
will typically lead to problems for nonparametric methods, as outlined in some of 
the problems. These issues are discussed in more detail in Chapter 13. 


6.1 TRANSIENT-RESPONSE ANALYSIS AND CORRELATION ANALYSIS 
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Impulse-Response Analysis 
If a system that is described by (2.8) 


y(t) = Golg)u(t) + vlt) (6.1) 
is subjected to a pulse input 
_ fa t=0 5 
u= pee of (6.2) 
then the output will be 
y(t) = ago{t) + v(t) (6.3) 
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by definition of Gp and the impulse response {g,(t)}. If the noise level is low, it is thus 
possible to determine the impulse-response coefficients {go(1)} from an experiment 
with a pulse input. The estimates will be 


a(t) = “a (6.4) 
Q 


and the errors u(t)/a@. This simple idea is impulse-response analysis. Its basic weak- 
ness is that many physical processes do not allow pulse inputs of such an amplitude 
that the error v(t)/æ is insignificant compared to the impulse-response coefficients. 
Moreover, such an input could make the system exhibit nonlinear effects that would 
disturb the linearized behavior we have set out to model. 


Step-Response Analysis 


Similarly, a step 


> 
Hi = fo Zo 
applied to (6.1) gives the output 
t 
y(t) = a X golk) + v(t) (6.5) 
k=] 


From this, estimates of go(k) could be obtained as 


a(t) = i muah 3 oad (6.6) 
a 
which has an error [u(t) — u(t — 1)]/a. If we really aim at determining the impulse- 
response coefficients using (6.6), we would suffer from large errors in most practical 
applications. However. if the goal is to determine some basic control-related charac- 
teristics, such as delay time. static gain, and dominating time constants [i.e., the model 
(4.50)]. step responses (6.5) can very well furnish that information to a sufficient de- 
gree of accuracy. In fact. well-known rules for tuning simple regulators such as the 
Ziegler-Nichols rule (Ziegler and Nichols, 1942) are based on model information 
reached in step responses. 
Based on plots of the step response, some characteristic numbers can be graph- 
ically constructed. which in turn can be used to determine parameters in a model of 
given order. We refer to Rake (1980)for a discussion of such characteristics. 


Correlation Analysis 


Consider the model description (6.1): 


yO) = $ goul — k) + v(t) (6.7) 


k=l 


If the input is a quasi-stationary sequence [see (2.59)} with 


Eu(t)u(t — t) = R,(1) 
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and _ 
Eu(t)u@ —1t) =0 (open-loop operation) 


then according to Theorem 2.2 (expressed in the time domain) 


x 
Ey(t)u(t — t) = Ryy(t) = Y= golk) Rulk =T) (6.8) 
k=] 


If the input is chosen as white noise so that 


R,(t) = aĉo 
then 
Ry y(t) 
got) = —— 
a 
An estimate of the impulse response is thus obtained from an estimate of Ry,,(T): 
for example, 
N 
a y 1 , 
RX) = x X yul — r) (6.9) 
I=T 
If the input is not white noise, we may estimate 
1 N 
RX (rt) = — X uult- r 6.10 
I(t) = > 2 (ult — 1) (6.10) 
and solve 
M 
RY) = X ORY - t) (6.11) 
k=l 


for g(k). If the input is open for manipulation, it is of course desirable to choose 
it so that (6.10) and (6.11) become easy to solve. Equipment for generating such 
signals and solving for g(k) is commercially available. See Godfrey (1980)for a 
more detailed treatment. 

In fact. the most natural way to estimate g(k) when the input is not “exactly 
white” is to truncate (6.7) at n, and treat it as an n:th order FIR model (4.46) with 
the parametric (least-squares) methods of Chapter 7. Another way is to filter both 
inputs and outputs by a prefilter that makes the input as white as possible (“input 
prewhitening”) and then compute the correlation function (6.9) for these filtered 
sequences. 


6.2 FREQUENCY-RESPONSE ANALYSIS 


Sine-wave Testing 


The fundamental physical interpretation of the transfer function G(z) is that the 
complex number G(e'”) bears information about what happens to an input sinusoid 
[see (2.32) to (2.34)]. We thus have for (6.1) that with 


u(t) = æ coswt, ie — P er (6.12) 
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then 
v(t} = a|Go(e')| cos(art + yg) + v(t) + transient (6.13) 


where 
p = argGy(e") (6.14) 


This property also gives a clue to a simple way of determining Go( et”); 


With the input (6.12). determine the amplitude and the phase shift of the 
resulting output cosine signal, and calculate an estimate Gy (e’”) based on that 


information. Repeat for a number of frequencies in the interesting frequency 
band. 


This is known as frequency analysis and is a simple method for obtaining detailed 
information about a linear system. 


Frequency Analysis by the Correlation Method 


With the noise component v(t) present in (6.13). it may be cumbersome to determine 
{Go{e'”)| and g accurately by graphical methods. Since the interesting component 
of y(t) is a cosine function of known frequency. it is possible to correlate it out from 
the noise in the following way. Form the sums 


N N 
ILAN) = > COS wf, (N) = Z Soy) sinat (6.15) 
1=1 


i=1 
Inserting (6.13) into (6.15). ignoring the transient term. gives 


N N 
1 l 
W Zs Meese + p)coswt + N >: u(t) cos wt 


i=! 


I,(N) 


N 
> lle 
alGoleris a die + cos(2wt + ¢)] 


+ = 3 v(t) cos wt (6.16) 
a a es 
a 5 |Gole'*)| cosy aa alGole 15 3 + 9) 
N 


1 
+ W 5. u(t) cos wt 


t=l 
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The second term tends to zero as N tends to infinity. and so does the third term if u(r) 
does not contain a pure periodic component of frequency w. If {v(t)} isa Stationary 
stochastic process such that 


30 


> t|R.(t)| < œ 


0 


then the variance of the third term of (6.16) decays like 1/N (Problem 6T.2). Simi- 
larly. 


N 
f , 11 
LN) = —51Go(e™)I sing + alGole Ses +9) 
N 


1 
+ W 2 u(t) sin wt (6.17) 


These two expressions suggest the following estimates of |Gy(e'”)| and g: 


(NY + IN) 


IGy(e'”)| = ae 


(6.18a) 


ies 9 LN 
argG y (e'®) = — arctan oon (6.1&b) 
C 


PN 


Rake (1980)gives a more detailed account of this method. By repeating the pro- 
cedure for a number of frequencies, a good picture of Gole'®) over the frequency 
domain of interest can be obtained. Equipment that pérforms such frequency anal- 
ysis by the correlation method is commercially available. 

An advantage with this method is that a Bode plot of the system can be obtained 
easily and that one may concentrate the effort to the interesting frequency ranges. 
The main disadvantage is that many industrial processes do not admit sinusodial 
inputs in normal operation. The experiment must also be repeated for a number of 
frequencies which may lead to long experimentation periods. 


Relationship to Fourier Analysis 


Comparing (6.15) to the definition (2.37). 


N 

1 ; 

Yv(w) = ——= )_ yte (6.19) 
mÈ 

shows that 
1 
LAN) — il(N) = — Y, (w 6.20) 
(N) (N) N y (%@) ( 
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As in (2.46) we find that, for (6.12), 


VNa 


2r 
Uy(w) = 5 ifo = = for some integer r (6.21) 


It is straightforward to rearrange (6.18) as 


Soi VNYn(w) 
G f ti -= pete ee 6.22 
ale”) Na/2 (6.22) 
which, using (6.21). means that 
Ya (w) 
Gx (e) = = 6.23 
v(e) ule) (6.23) 


Here w is precisely the frequency of the input signal. Comparing with (2.53), we also 
find (6.23) a most reasonable estimate (especially since Rx (æ) in (2.53) is zero for 
periodic inputs, according to the corollary of Theorem 2.1). 


6.3 FOURIER ANALYSIS 


Empirical Transfer-function Estimate 


We found the expression (6.23) to correspond to frequency analysis with a single 
sinusoid of frequency w as input. In a linear system, different frequencies pass 
through the system independently of each other. It is therefore quite natural to 
extend the frequency analysis estimate (6.23) also to the case of multifrequency 
inputs. That is, we introduce the following estimate of the transfer function: 


with Yy and Uy defined by (6.19). also for the case where the input is not a single 
sinusoid. This estimate is also quite natural in view of Theorem 2.1. 


We shall call Gy (e'”) the empirical transfer-function estimate (ETFE), for rea- 
sons that we shall discuss shortly. In (6.24) we assume of course that Uy (w) 4 0. If 
this does not hold for some frequencies. we simply regard the ETFE as undefined at 
those frequencies. We call this estimate empirical, since no other assumptions have 
been imposed than linearity of the system. In the case of multifrequency inputs, the 
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ETFE consists of N /2 essential points. [Recall that estimates at frequencies interme- 
diate to the grid w = 27k/N.k =0,1,... , N—1, are obtained by trigonometrica| 
interpolation in (2.37)]. Also, since y and u are real, we have 


a F A di(N -KIN 
Gu (er N) = Gylle iN KUNS (6.25) 


[compare (2.40) and (2.41)]. 
The original data sequence consisting of 2N numbers y(t), u(t), t = 
1.2.... , N, has thus been condensed into the N numbers 


ReĜpn (eN),  ImGy(e™"*#%) kk = 0.1,..., cr 
This is quite a modest data reduction, revealing that most of the information con- 
tained in the original data y, u still is quite “raw.” 
In addition to an extension of frequency analysis, the ETFE can be interpreted 
as a way of (approximately) solving the set of convolution equations 


N 
y(t) = So goult- k) t= 1.2,...,N (6.26) 
k=} 


for go(k).k =1,2,..., N, using Fourier techniques. 


Properties of the ETFE 


Assume that the system is subject to (6.1). Introducing 


N 
1 
Vlo) = Tn ) ve, (6.27) 
t=] i 


for the disturbance term, we find from Theorem 2.1 that 


Ry (w) ie Vn (w) 
Un(w)  Un(w) 


nlet) = Golet”) + 


where the term Ry (w) is subject to (2.54) and decays as 1//N. 


Let us now investigate the influence of the term Vy (w) on Gy (e'”). Since v(t) 
is assumed to have zero mean value, 


EVy(o) =0, Yo 


so that 
â ; l R 
EGy(e'®) = Galet) + a) (6.29) 
Un(w) 
Here expectation is with respect to {v(t)}, assuming {u(r)} to be a given sequence 
of numbers. 
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Let the covariance function R,{t) and the spectrum ,(q@) of the process 


{u(t)} be defined by (2.14) and (2.63). Then evaluate 


— bs S Ev(rje™® u(syet#é5 


E Vy (w) Vv (—&) 


Ny s=l 
= 1 Sy et- OER. (r a s) = [r — = T] 
r=1| s=} 
1 N r—] 
= TA : 5y R,(t)e™ëT 
r=1 t=ar—N 
Now 
r—1 ; r-N-1 
XO Rete BF = OE) - XO e R(t) — Eein (T) 
t=r-N tT=-X 
and 
1 Š TEN 1. iff =w r 
2 U§—awjyr = 2n 
Nl 0. tE = o) re k = +1,2+2,...,4(N — 1) 
r= 
Consider 
1 N r-N-1 N r—-N-1 
Nose Dw Re) S >> >> IRC) 
r=1 t=-Xx r=] T=- 
< [change order of summation] 
tc 
25 Ll IRs 5 
t=- 
provided 
x 
Xir - Ry(t)| < 90 (6.30) 
-X 
Similarly, 


N 


Ee i- ore it R (t)| < ye IR (T) < = 


r=1 t=r t=1 
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Combining these expressions, we find that 
EVs: (@)Vy(—€) 
lw) + P(N). fE =w 5 
p2(N). ifl -ol = —. oe ey meee ee eae 


with |2(N)| < 2C/N. These calculations can be summarized as the following 
result. 


Lemma 6.1. Consider a strictly stable system 
v(t) = Go(q)u(t) + v(t) (6.32) 


with a disturbance {v(ż)} being a stationary stochastic process with spectrum P, (c) 
and covariance function R,.(t), subject to (6.30). Let {u(t)} be independent of {v(1}} 


assume that |#(t)| < C for all t. Then with Gy (e'”) defined by (6.24), we have 


4 s f N 
EGy(e'®) = Gale!) + PiN) (6.33a) 
Uy (w) 
where 
Cı 
(N)| < — (6.33b) 
|e (N)| TN 
and 
E[s (e) — Golei”) [Èn (eE) — Gole!)] 
Trin Pum) + aN). if& =o i (6.342) 
= ; 271 
= (NV) : = a, 
RE ME Ta k=1.2.....N -1 
where 
C 
Io (N)| < F (6.34b) 


Here Ux 1s defined by (2.37), and we restrict ourselves to frequencies for which Gy 
is defined. According to Theorem 2.1 and (6.30), the constants can be taken as 


[ow 
Ci = (È ee) - max|u(t)| (6.35a) 
k=l 
C= C+ D> RO] (6.35b) 
k=-x 


If {u(1)} is periodic, then according to the Corollary of Theorem 2.1 pı (N) = 0 at 
w = 27k/N,so we can take Cı = 0. 
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Remark. Note that the input is regarded as a given sequence. Probabilistic 
quantities, such as Æ. “bias,” and “variance” refer to the probability space of {u(1)}. 
This does not, of course, exclude that the input may be generated as a realization of 
a stochastic process independent of {u(t)}}. 


The properties of the ETFE are closely related to those of periodogram esti- 
mates of spectra. See (2.43) and (2.74). We have the following result. 


Lemma 6.2. Let v(t) be given by 
v(t) = H(q)e(t) 


where {e(f)} is a white-noise sequence with variance 4 and fourth moment u`. and 
H isa strictly stable filter. Let Vy(w) be defined by (6.27), and let P (w) be the 
spectrum of u(t). Then 
E| Vno)? = lw) + p3(N) (6.36) 
ECI VN (w? — Pe(@)) (IVE)? — PE) 
[P.@)P + paN). if =w E # 0. 
~ | pst), if |f — w| = 


where 


C C 
N < —- — 
IBN) Sp aM Ss 5 


Proof. Equation (6.36) is a restatement of (6.31). A simple proof of (6.37) is 
outlined in Problem 6D.2 under somewhat more restrictive conditions. A full proof 
can be given by direct evaluation of (6.37). See, for example. Brillinger (1981). 
Theorem 5.2.4. for that. See Problem 6G.5 for ideas on how the bias term can be 
improved by the use of data tapering. O 


These lemmas, together with the results of Section 2.3, tell us the following: 


Case 1. The input is periodic. When the input is periodic and N is a multiple 
of the period, we know from Example 2.2 that |U y (w)? increases like const - N for 
some w and is zero for others [see (2.49)]. The number of frequencies w = 2x k/N 
for which |Uy(w)|* is nonzero, and hence for which the ETFE is defined. is fixed 
and no more than the period length of the signal. We thus find that 


e The ETFE Gy (e'”) is defined only for a fixed number of frequencies. 
e At these frequencies the ETFE is unbiased and its variance decays like 1/N. 


We note that the results (6.16) on frequency analysis by the correlation method 
are obtained as a special case. 
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Case 2. The input is a realization of a stochastic process. Lemma 6.2 shows 
that the periodogram |U x(w)? is an erratic function of œ. which fluctuates around 
®,,(w) which we assume to be bounded. Lemma 6.1 thus tells us that 


e The ETFE is an asymptotically unbiased estimate of the transfer function at 
increasingly (with N ) many frequencies. 


e The variance of the ETFE does not decrease as N increases, and it is given as 
the noise-to-signal ratio at the frequency tn question. 


e The estimates at different frequencies are asymptotically uncorrelated. 


It follows from this discussion that, in the case of a periodic input signal. the 
ETFE will be of increasingly good quality at the frequencies that are present in the 
input. However, when the input is not periodic. the variance does not decay with V. 
but remains equal to the noise-to-signal ratio at the corresponding frequency. This 
latter property makes the empirical estimate a very crude estimate in most cases in 
practice. 


It is easy to understand the reason why the variance does not decrease with 
N. We determine as many independent estimates as we have data points. In other 
words, we have no feature of data and information compression. This in turn is due 
to the fact that we have only assumed linearity about the true system. Consequently, 
the system's properties at different frequencies may be totally unrelated. From this 
it also follows that the only possibility to increase the information per estimated 
parameter is to assume that the system's behavior at one frequency is related to that 
at another. In the subsequent section, we shall d:scuss one approach to how this can 
be done. 


6.4 SPECTRAL ANALYSIS 


Spectral analysis for determining transfer functions of linear systems was developed 
from statistical methods for spectral estimation. Good accounts of this method are 
given in Chapter 10 in Jenkins and Watts (1968) and in Chapter 6 in Brillinger (1981). 
and the method is widely discussed in many other textbooks on time series analysis. 
In this section we shall adopt a slightly non-standard approach to the subject by 
deriving the standard techniques as a smoothed version of the ETFE. 


Smoothing the ETFE 


We mentioned at the end of the previous section that the only way to improve on 
the poor variance properties of the ETFE is to assume that the values of the true 
transfer function at different frequencies are related. We shall now introduce the 
rather reasonable prejudice that 


The true transfer function Golet) is a smooth function of w. (6.38) 
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If the frequency distance 27 /N is small compared to how quickly Go(e'®) changes, 
then 


Galen): k integer. 2tk/N Zw (6.39) 
are uncorrelated, unbiased estimates of roughly the same constant Go(e'”), each 


with a variance of 
®,.(22k/N) 


[Un (27 k/N)|? 


according to Lemma 6.1. Here we neglected terms that tend to zero as N tends to 


infinity. 
If we assume Gole”) to be constant over the interval 
27 ky 20 k 
S ON Oe) = Oe an BO (6.40) 


then it is well known that the best (in a minimum variance sense) way to estimate 
this constant is to form a weighted average of the “measurements” (6.39) for the 
frequencies (6.40), each measurement weighted according to tts inverse variance 
[compare Problem 6E.3. and Lemma II.2, (11.65), in Appendix I: 


ke 
A 2 IKIN 
X aĜyle MAREN 


k=k, 
ks 
dom 


k=k, 


[Uy (22k/N)/? 
= ®,.(27k/N) (0:410) 


Gy(e™) = (6.41a) 


For large N we could with good approximation work with the integrals that 
correspond to the (Riemann) sums in (6.41): 


wat Aw a B 
~ F =i — w XE)G n (e'f dé U r 2 
eee as) dé P. (E ) 


If the transfer function Go is not constant over the interval (6.40) it is reasonable 
to use an additional weighting that pays more attention to frequenctes close to wp: 


ot wh * Wie — Gx (eld 
Gy (ei) £ as 2 aga (E) v(e™) 3 (6.43) 
J W, (E — wn)alé)dg 


Here W, (£) is a function centered around & = 0 and y is a “shape parameter,” 
which we shall discuss shortly. 
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Clearly. (6.42) corresponds to 


1. A 
vasl te ae (6.44) 


Now, if the noise spectrum ®,.(w) is known, the estimate (6.43) can be realized as 
written. If ®..(w) is not known we could argue as follows: Suppose that the noise 
spectrum does not change very much over frequency intervals corresponding to the 
“width” of the weighting function W, (&): 


1 l 


A WE = EE Dlo) 


Then &(¢) in (6.42) can be replaced by a(€) = |U (E)|°/®,.(w@p). which means that 
the constant ®,.(@) cancels when (6.43) is formed. Under (6.45) the estimate 


dë = “small” (6.45) 


JZ WE — ap )lUn (ENP Gye dé 


G ion) — = 
n(e ) Ja WL — wy) (Un (E) dE 


is thus a good approximation of (6.42) and (6.43). 
We may remark that, if (6.45) does not hold, it might be better to include a 
procedure where ®,,(@) is estimated and use that estimate in (6.43). 


Connection with the Blackman-Tukey Procedure (+) 
Consider the denominator of (6.46). It is a weighted average of the periodogram 
{Ux (€)|*. Using the result (2.74). we find that. as N — oc, 


T 


W,(E — oU EdE > | WẸ — cn) bu(E dE (6.47) 


-x7 =T. 


where ®, (w) is the spectrum of {u(t)}, as defined by (2.61) to (2.63). If, moreover. 


J W,(E)dE = 1 


and the weighting function W, (€) is concentrated around € = 0 with a width over 
which ®,(@) does not change much, then the right side of (6.47) is close to ®, (wo). 
We may thus interpret the left side as an estimate of this quantity: 


a 


Êro) = | WE — w)lUn E) dg 


Similarly, since 


Yn (E) 


= Yx (&)U 4 
Uy (é) n(E)UN(E) (6.49) 


Un (é)PEy(e#) = UWP 
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we have that the numerator of (6.46) 
T 
Dilo) = J W, (E — wo) Yn (E)U x(Ẹ)dE (6.50) 
-T 


is an estimate of the cross spectrum between output and input. The transfer function 
estimate (6.46) is thus the ratio of two spectral estimates: 


which makes sense. in view of (2.80). The spectral estimates (6.48) and (6.50) are 
the standard estimates, suggested in the literature, for spectra and cross spectra as 
smoothed periodograms. See Blackman and Tukey (1958). Jenkins and Watts (1968). 
or Brillinger (1981). 

An alternative way of expressing these estimates is common. The Fourier 
coefficients for the periodogram |Uy(w)|* are 


N 
ie a ; ae 
R^ (t) = F J Un (w) ei tdw = y2 uul — t) (6.52) 
=a 1=1 


[For this expression to hold exactly. the values u (s) outside the interval 1 < s < N 
have to be interpreted by periodic continuation: i.e., u(s) = u(s — N) if s > N: see 
Problem 6D.1.] 

Similarly, let the Fourier coefficients of the function 27 W, (E) be 


T 


w(t) = W, (Eje dE (6.53) 


Since the integral (6.48) is a convolution, its Fourier coefficients will be the product 
of (6.52) and (6.53), so a Fourier expansion of (6.48) gives 
cad - 
O¥(w) = SO wy (RY (et (6.54) 
tx 
The idea is now that the nice, smooth function W,(&) is chosen so that its Fourier 


coefficients vanish for |t] > 6,.where typically 6, << N. Itisconsequently sufficient 
to form (6.52) (using the rightmost expression) for |t| < 6,. and then take 


by 
SNo = Do wy (RPE (6.55) 
t=—6, 
This is perhaps the most convenient way of forming the spectral estimate. The ex- 


. z N ` 
pressions for ®\, (œ) are of course analogous. 
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Weighting Function W, (£): The Frequency Window 


Let us now discuss the weighting function W, (€). In spectral analysis. it is often 
called the Frequency window. (Similarly, w,(t) is called the lag window.] If this 
window is “wide,” then many different frequencies will be weighted together in (6.40), 
This should lead to a small variance of G v(ei). At the same time, a wide window 
will involve frequency estimates farther away from wo. with expected values that 
may differ considerably from Go(e'™). This will cause large bias. The width of 
the window will thus control the trade-off between bias and variance. To make this 
trade-off a bit more formal, we shall use the scalar y to describe the width, so a large 
value of y corresponds to a narrow window. 


We shall characterize the window by the following numbers 


x 


W,(E)de = 1, | “ew, EdE =0, | EW, EdE = MO) (656a) 


-7n 


a Fe 
T EPW Edi = G). [ Wed = Wo) (6.56b) 


-x 


As y increases (and the frequency window gets more narrow), the number M (y) 
decreases, while W (y) increases. 


Some typical windows are given in Table 6.1. [See, also, Table 3.3.1 in Brillinger 
(1981)for a more complete collection of windows.] Notice that the scaling quantity 


TABLE 6.1 Some Windows for Spectral Analysis 


2x W, (w) w(t), O<|tl<y 
1 i 2\7 T 
Bartlett — = yo/ ) 1— al 
y \ sinw/2 Y 
6t T 
: 1-S (1- E), sist 
4(2 + cos w) (= zert) y? yY 2 
Parzen — z 
y? sin w/2 itl 3 y 
2(1- =). =<I|Iti<y 
Y 2 


1 HT 
Hamming + Dy (w) + iD, (w—7/y) 5 ( + cos =) 


+iD, (@+2/y), where 


sin{y + 1w 
D SS 
rœ) sin w/2 
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(rad/s) 


Figure 6.1 Some common frequency windows. Solid line: Parzen: dashed line: 
Hamming: dotted fine: Bartlett, y = 5. 


y has been chosen so that 6, = y in (6.55). The frequency windows are shown 
graphically in Figure 6.1. For these windows, we have 


2.78 = 
Bartlett: M(y) = Py Wy) ~ 0.67y 
12 ee 
Parzen: M(y) = my] Wy) ~ 0.54y (6.57) 
ne a 
Hamming: My) = — Wy) ~ 0.75y 
2y? 


The expressions are asymptotic for large y but are good approximations for y = 5. 
See also Problem 6T.1 for a further discussion of how to scale windows. 
Asymptotic Properties of the Smoothed Estimate 


The estimate (6.46) has been studied in several treatments of spectral analysis. Re- 
sults that are asymptotic in both N and y can be derived as follows (see Appendix 
6A). Consider the estimate (6.46), and suppose that the true system obeys the as- 
sumptions of Lemma 6.1. We then have 


Bias 


a ; z 1 RoE PE p’, 
EGu(e'*) — Gole”) = M(y) - Ea + Gye” oe a 


+ O(C3(y)) +00/VN) (6.58) 


yoo N — 0 
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Here O(x) is ordo x. See notational conventions at the beginning of the book. 
Prime and double prime denote differentiation with respect to w. one and twice 
respectively. 


Variance 


cad ee T D 
E|\Gy(e'®) — EGxle®)|- = N` Wy) - am + ae (6.59) 


-xX 
We repeat that expectation here is with respect to the noise sequence {v(t)} and that 


the input is supposed to be a deterministic quasi-stationary signal. 
Let us use the asymptotic expressions to evaluate the mean-square error (MSE); 


iw iw p, (w) 
E|Gx(el”) — Gole ~ MOIRE? + Wy D.o) (6.60) 
Here 
1 UPE tw el” D, ( 
R(w) = 5 Pole )+ Gole RER = (6.61) 


Some additional results can also be shown (see Brillinger, 1981, Chapter 6, and 
Problems 6D.3 and 6D.4). 


e The estimates ReGy(e) and ImGy(e!) are asymptotically uncorrelated 


and each have a variance equal to half that in (6.59). (6.62) 
e The estimates yle) at different frequencies are asymptotically uncorre- 
lated. (6.63) 
e The estimates ReĜ p (ei), ImGy (e'*). k =1,2,... , M atan arbitrary col- 
lection of frequencies are asymptotically jointly normal distributed with means 
and covariances given by (6.58) to (6.63). (6.64) 


e Fora translation to properties of len (e'”)|, arg G y (e2), see Problem 9G.1. 


From (6.60) we see that a desired property of the window is that both M and 
W should be small. We may also calculate the value of the width parameter y that 
minimizes the MSE. Suppose that both y and N tend to infinity and y/N tends 
to zero, so that the asymptotic expressions are applicable. Suppose also that (6.57) 
holds with M(y) = M/y? and W(y) = y - W Then (6.60) gives 


(AM Reo)" 1/5 
Yor = | >=... CN 


— 6.65 
WO, (w) ee 


This value can of course not be realized by the user, since the constant contains severa! 
unknown quantities. We note, however. that in any case it increases like N'^, and it 
should, in principle. be allowed to be frequency dependent. The frequency window 
consequently should get more narrow when more data are available, which is a very 
natural result. 
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The optimal choice of y leads to a mean-square error that decays like 
MSE ~ C. N74 (6.66) 


In practical use the trade-off (6.65) and (6.66) cannot be reached in formal terms. 
Instead, a typical procedure would be to start by taking y = N/20 (see Table 6.1) 
and then compute and plot the corresponding estimates Gy (e'”) for various values 
of y. As y is increased. more and more details of the estimate will appear. These 
will be due to decreased bias (true resonance peaks appearing more clearly and the 
like), as well as to increased variance (spurious, random peaks). The procedure will 
be stopped when the user feels that the emerging details are predominately spurious. 

Actually, as we noted, (6.65) points to the fact that the optimal window size 
should be frequency dependent. This can easily be implemented in (6.46), but not in 
(6.55), and most procedures do not utilize this feature. 


Example 6.1 A Simulated System 


The system 
v(t) —1.5y(@@ — 1) + 0.7y(t — 2) = u(t — 1) + 0.5u(r — 2) + e(t) (6.67) 


where {e(r)} is white noise with variance 1 was simulated with the input as a PRBS 
signal (see Section 13.3) over 1000 samples. Part of the resulting data record is 
shown in Figure 6.2. The corresponding ETFE is shown in Figure 6.3a. An estimate 
Gy (e'”) was formed using (6.46). with W,(€) being a Parzen window with various 
values of y. Figure 6.3bcd shows the results for y = 10, 50, and 200. Here y = 50 
appears to be a reasonable choice of window size. 


OUTPUT #1 


100 110 120 130 140 150 160 170 180 190 200 


i INPUT #1 


0.5 


(=) 


-0.5 | 
+ | 
100 110 120 130 140 150 160 170 180 190 200 


Figure 6.2 The simulated data from (6.67). 
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Figure 6.3 Amplitude plots of the estimate G (e). a: ETFE. b: y = 10.c: y = 50). d: 
y = 200. Thick lines: true system: thin lines: estimate. 


Another Way of Smoothing the ETFE (*) 


The guiding idea behind the estimate (6.46) is that the ETFEs at neighboring frequen 
cies are asymptotically uncorrelated. and that hence the variance could be reduced b 
averaging over these. The ETFEs obtained over different data sets will also provid 
uncorrelated estimates, and another approach would be to form averages over thes« 
Thus. split the data set Z^ into M batches. each containing R data (N = R- M` 
Then form the ETFE corresponding to the kth batch: 


Grey, SA Qin (6.68 
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The estimate can then be formed as a direct average 
G vel”) = 14 > Ôe”) (6.69) 
M k=1 g l 
or one that is weighted according to the inverse variances: 


Aw )- GPE) 


Gy(e®) = S (6.70) 
> Be (o) 
k=l 
with 
Baw) = [Uk (o)? (6.71) 


being the periodogram of the kth subbatch. The inverse variance of GY (ei ) is 


OLOA (w). but the factor ®,.(@) cancels when (6.70) is formed. 
An advantage with the estimate (6.70) is that the fast Fourier transform (FFT) 
can be efficiently used when Z^ can be decomposed so that R is a power of 2. 
Compare Problem 6G.4. The method is known as Welch’s method. Welch (1967). 


6.5 ESTIMATING THE DISTURBANCE SPECTRUM (+) 
Estimating Spectra 
So far we have described how to estimate Go in a relationship (6.1): 
v(t) = Golg)u(t) + v(t) (6.72) 


We shall now turn to the problem of estimating the spectrum of {v(r}}, ®.(@). Had 
the disturbances u(t) been available for direct measurement. we could have used 
(6.48): 


é*w) = [ WE - w)lVu(e) ae (6.73) 


-7 


Here W,(-) is a frequency window of the kind described earlier. 
It is entirely analogous to the analysis of the previous section to calculate the 
properties of (6.73). We have: 


Bias: 


EY (w) — ,(@) = 1M(y)- ® lo) + City) + O1/VN N) (6.74) 


yrs 
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Variance: 


ae WwW : 
Varo (w) = aw Oo) + 01/N), wo # OH (6.75 
Noo 


Moreover, estimates at different frequencies are asymptotically uncorrelated. 
The Residual Spectrum 


Now the u(t) in (6.72) are not directly measurable. However. given an estimate Gi, 
of the transfer function. we may replace v in the preceding expression by 


Òl) = y(t) — Gw(q)ut) (6.76) 


which gives the estimate 


n 


`w) = | WE — w)Yn G) — Gyle®)UN(E) dE (6.77) 


-x 


If G x (e!) is formed using (6.46) with the same window W, (-), this expression can 
be rearranged as follows [using (6.48) to (6.51)]: 


n 


W, E — w)lYn(@)Pde + | WE — o) Un)? 1G (ei) dE 


-T -T 
T 


- 2Re | Wy — w)Gw(e¥)Un(E)¥n EdE, 


x f | WE — o)YNEPdE + Inti | WE — o)Un G) Pde 


-7 


— 2ReĜn (et) | WE — o)Un (E)¥v EdE 


=n 
: ô“ (w) ere pr (w) = 
= $Y (w) + ae - ÈÌ (w) — 2Re=— - ÒN (w) 
: (0) (wo)? u(y 


Here the approximate equality follows from replacing the smooth function G n (e?) 
over the small interval around € = w with its value at w. Hence we have 


IŻ3 (o)l? 
ÈN (w) 


$` (w) = $Y (w) — 


Sec. 6.6 Summary 189 


Asymptotically. as N — æ and y — ox. so that Gate) > Golei”) 
according to (6.60), we find that the estimate (6.77) tends to (6.73). The asymptotic 
properties (6.74) and (6.75) will also hold for (6.77) and So 78). In addition to the 
properties already listed, we may note that the estimates 6s (œw) are asymptotically 
uncorrelated with Gx (e’”). Moreover ÊY (w). Gi (ein), k= 1,2.....r.are 
asymptotically jointly normal random variables with mean and covariances given by 
(6.58) to (6.64) and (6.74) to (6.75). A detailed account of the asymptotic theory is 
given in Chapter 6 of Brillinger (1981). 


Coherency Spectrum 


Denote 
V (wy) |e or (6.79) 
Ky w) = —e—e— 
ue \ bY (w) b¥ (w) 
Then 
bY (w) = OY (wll — R] oN] (6.80) 


The function Kyu (w) is called the coherency spectrum (between y and u) and can be 
viewed as the (frequency dependent) correlation coefficient between the input and 
output sequences. If thts coefficient is 1 at a certain frequency. then there is perfect 
correlation between input and output at that frequency. There is consequently no 
noise interfering at that frequency, which is confirmed by (6.80). 


6.6 SUMMARY 


In this chapter we have shown how simple techniques of transient and frequency 
response can give valuable insight into the properties of linear systems. We have 
introduced the empirical transfer-function estimate (ETFE) 


= Ys (w) 
— Unlo) 


based on data over the interval 1 < tf < N. Here 


(6.81) 


N N 
1 A 1 l 
Yy (w) = ce X y(tye tt, Ux (w) = R u(tye™, 
= t=1 


The ETFE has the property (see Lemma 6.1) that it is asymptotically unbiased, but 
has a variance of ®..(w)/|Un(w)}*. 
We showed how smoothing the ETFE leads to the spectral analysis estimate 


fo WE — @)|Un(E)P Gy (e dE 


Gye”) 2 3 i 
i SWE — wUn(E)|dE 


(6.82) 
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A corresponding estimate of the noise spectrum is 


T 
Sro = | WE — ening) = Gute UVEPaE 683 
—T 

The properties of these estimates were summarized in (6.58) to (6.64) and (6.74) to 
(6.75). 

These properties depend on the parameter y . which describes the width of the 
associated frequency window W,,. A narrow such window (large y ) gives small bias 
but high variance for the estimate. while the converse is true for wide windows. 


6.7 BIBLIOGRAPHY 


Section 6.1: Wellstead (1981)gives a general survey of nonparametric methods for 
system identification. A survey of transient response methods is given in Rake (1980), 
Several wavs of determining numerical characteristics from step responses are dis- 
cussed in Schwarze (1964). Correlation techniques are surveyed in Godfrey (1980). 


Section 6.2: Frequency analysis is a classical identification method that is described 
in many textbooks on control. For detailed treatments. see Rake (1980)which also 
contains several interesting examples. 


Section 6.3: General Fourier techniques are also discussed in Rake (1980). The 


term “empirical transfer function estimate“ for G is introduced in this chapter, but 
the estimate as such is well known. 


Sections 6.4 and 6.5: Spectral analysis is a standard subject in textbooks on time 
series. See, for example. Grenander and Rosenblatt (1957)(Chapters 4 to 6). An- 
derson (1971)(Chapter 9), and Hannan (1970)(Chapter Y). Among books devoted 
entirely to spectral analysis, we could point to Kay (1988). Marple (1987). and Sto- 
ica and Moses (1997). These texts deal primarily with estimation of power (auto-) 
spectra. Among specific treatments of frequency-domain techniques, including esti- 
mation of transfer functions, we note Brillinger (1981)for a thorough analytic studs. 
Jenkins and Watts (1968)for a more leisurely discussion of both statistical properties 
and application aspects, and Bendat and Pierso] (1980)for an application-oriented 
approach. Another extensive treatment is Priestley (1981). Overviews of differ- 
ent frequency-domain techniques are given in Brillinger and Krishnaiah (1983), and 
a control-oriented survey is given by Godfrey (1980). The treatment given here is 
based on Ljung (1985a). The first reference to the idea of smoothing the periodogram 
to obtain a better spectral estimate appears to be Daniell (1946). A comparative dis- 
cussion of windows for spectral analysis is given in Gecklini and Yavuz (1978)and 
Papoulis (1973). 

In addition to direct frequency-domain methods for estimating spectra, manv 
efficient methods are based on parametric fit, such as those to be discussed in the 
following chapter. So called maximum entropy methods (MEM) have found wide 
use in signal-processing applications. See Burg (1967)for the first idea and Marple 
(1987)for a comparative survey of different approaches. 
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6.8 PROBLEMS 


6G.1 


6G.2 


6G.3 


Consider the system 


y(t) = Golg)u(t) + u(t) 
controlled by the regulator 
u(t) = —F(qg)y(t) + r(t) 


where r(f) is an external reference signal. r and v are independent and their spectra 
are ,(w) and P,.(w), respectively. The usual spectral analysis estimate of Gp is given 
in (6.51) as well as (6.46). Show that as N and y tend to infinity then Gy (e'”) will 
converge to 

Gole'’)®,(w) — Fle!) ®, (w) 


aoe ®,(@) + Feit) ®,.(w) 


What happens in the two special cases ®, = Oand F = Q. respectively? Hint: Compare 
Problem 2E.5. 


Prefiltering. Prefilter inputs and outputs: 
ur(t) = Lu(g)ut), — Ye(t) = Ly(qi ys) 
If (6.32) holds, then the filtered variables obey 


ye(t) = Gi (qurt) + uF(t) 


L.(q) 


Gi q) = E 
u 


Go(q). vr{i) = L qeit) 
Apply spectral analysis to ur, vr. thus forming an estimate ĜE le”). The estimate of 
the original transfer function then is 


Lalet) ENE 
— G, (e? 
Ler) O“ (e ®) 


Ĝx le”) = 


Determine the asymptotic properties of G x(e'”) and discuss how L, and L, can be 
chosen for smallest MSE (cf. Ljung. 1985a). 


In Figure 6.3 the amplitude of the ETFE appears to be systematically larger than the 
true amplitude. despite the fact that the ETFE is unbiased according to Lemma 6.1. 


However. G being an unbiased estimate of Go does not imply that IG | is an unbiased 
estimate of |Go|. In fact, prove that 
P, (w) 


ElGule)2 = |G iwy p —— 
IGwlel*)F = IG + 


asymptotically for large N. under the assumptions of Lemma 6.1. 
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6G.4 The Cooley-Tukey spectral estimate for a process {v(7}} is defined as 


6G.5 


6E.1 


M 


P 1 Bs tad 
Nw = 7 X iVe w) 


a 


k=1 


where Ve (a)? is the periodogram estimate of the kth subbatch of data: 
i £ 
Ve (o) = —= Yolk — DR + HET") 


See Cooley and Tukey (1965)or Hannan (1970). Chapter V. The cross-spectral estimate 
is defined analogously. This estimate has the advantage that the FFT (fast Fourier 
transform) can be applied (most efficiently if R is a power of 2). Show that the estimate 
(6.70) is the ratio of two appropriate Coolev-Tukey spectral estimates. 

“Tapers” or “faders The bias term p3(N) in (6.36) can be reduced if tapering is 


introduced: Let Vi""(w) be defined by 


N 
Vw) = Yo heei” 


tæl 


where {h,}? is a sequence of numbers (a tapering function) such that 


N 
5 k=l 


(=i 


Let 


N 
J 
Hy(@) = yee : 
t=) 
Show that. under the conditions of Lemma 6.2, 


EVV Ww)? = | |Hy(@ — E) O,(E)dg 


-x7 


Show that our standard periodogram estimate. which uses A, = 1/7 N. gives 


. 27? 
ptor aer] 
N | sinw/2 

Other tapering coefficients (or “faders” or “convergence factors”) may give functions 
| Hx;(w)|* that are more “8 -function like” than in the preceding equation (sce. e.g., Table 
6.1). The tapered periodogram can of course also be used to obtain smoothed spectra. 
They will typically lead to decreased bias and (slightly) increased variance (Brillinger. 
1981. Theorem 5.2.3 and Section 5.8). 


Determine an estimate for Go(e'”) based on the impulse-response estimates (6.4). 
Show that this estimate coincides with the ETFE (6.24). 


6E2 


6E.3 


6T.1 
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Consider the system 
y(t) = Golg)u(s) + v(t) 


This system is controlled by output proportional feedback 


u(t) = —Ky(t) 


Let the ETFE G v(e!”) be computed in the straightforward way (6.24). What will this 
estimate be? Compare with Lemma 6.1. 


Let wg. k = 1.....M, be independent random variables, all with mean values 1 and 
variances E(w, — 1} = Ay. Consider 


Determine œg, k =1,... ,M,so that 

(a) Ew =1. 

(b) E(w — 1)? is minimized. 
A general approach to treat the relationships between the scaling parameter y and 
the lag and frequency windows w,(t), and W, (w) [see (6.53)] can be given as follows. 


Choose an even function w(x) such that w(0) = 1 and w(x) = 0, |x} > 1, with Fourier 
transform 


WA) = f w(xje "dx 


=% 


W =2x il W(A)dA, = i WWA)dA 


oS 


Then define the lag window 
w,(t) = w(t/y) 


This gives a frequency window 


y 
W, (w) = > w, (T) 


t=-y 
Show that, for large y. 
(a) W,(w) ~ y- Wyo) 
(b) M(y) = M/y° 
O Wy) *W-y 


where M(y) and W (y) are defined by (6.56). Moreover, compute and compare W, (w) 
and y - W (y -w) for w(x) = 1 — |x|, |x| < 1 (the Bartlett window). [Compare (6.57). 
See also Hannan (1970), Section V.4.] 
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6T.2 Let {u(t)} be a stationary stochastic process with zero mean value and covariance 
function R(T). such that 


DITRA) < œ 


N 
1 
Sy = pÈ i: lal < Cy 


Show that 


for some constant Cz. 
6D.1 Prove (6.52) with the proper interpretation of values outside the interval 1 < t < N. 


6D.2 Prove a relaxed version of Lemma 6.2 with |p,(N)| < C/ JN by a direct application 
of Theorem 2.1 and the properties of periodograms of white noise. 


6D.3 Prove (6.63) by using expressions analogous to (6A.3) and (6A.4). 
6D.4 Prove (6.62) by using (6.63) and 
Gle) + (e?) 
2 
Ge’) — G(e7i”) 
2i 


ReG(ei”) = 


ImG(e'”) = 


APPENDIX 6A: DERIVATION OF THE ASYMPTOTIC PROPERTIES 
OF THE SPECTRAL ANALYSIS ESTIMATE 


Consider the transfer function estimate (6.46). In this appendix we shall derive the 
asymptotic properties (6.58) and (6.59). In order not to get too technical, some 
elements of the derivation will be kept heuristic. Recall that {u(t)} here is regarded 
as a deterministic quasi-stationary sequence, and, hence, such that (6.47) holds. 
We then have 
Sf, WE — wo) Un tE)? [Gole®) + pi(N)/Un(E)] dé 
[2 Wy(E — Un (E) dE 


n Ln WE = Dul) Golet dE 
fZ W E — wy), E)dE 


EGu(ei®) = 


(6A.1) 


using first Lemma 6.1 and then (6.47), neglecting the decaying term p;(N). 
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Now, expanding in Taylor series (prime denoting differentiation with respect 
to w), 


1 EEN 
Gole?) © Gole) + (E — ay) Gali) + 5È — o) Gole) 
1 Rae a 
P(E) © Py (wy) + (E — wo) Po) + 566 — wp)" PD, (wo) 


and noting that, according to (6.56). 
N 


(E — wo)W, (E — wo)dE = 0 


(E = w) W, (E — adë = M(y) 


-T 


we find that the numerator of (6A.1) is approximately 


Y 1 n 1 tt À t 
Golei ™) Du (wv) + My) EZ + 5GoPu + vG 


and the denominator 
1 ji 


where we neglect effects that are of order C3(y) [an order of magnitude smaller than 
M(y) as y > æ; see (6.56)]. Equation (6A.1) thus gives 


ee? ns , 1 . -È (a) 
EGn(e'™) * Gale) + M -Go (e) + Gule) m 
nle’) ofe”) vl ole 9) + Gol | 
which is (6.58). 
For the variance expression, we first have from (6.28) and (6.46) that 


A iw = EC fw x 
ne JE, WE — wU EdE 


(6A.2) 


Let us study the numerator of this expression. We write this, approximately, as a 
Riemann sum {see (6.41): we could have kept it discrete all along]: 


x 


W, (E — o)Un(E)Vn (EdE © Ay 


-7 


N/2 


a E E)n E a 


k=—(N/2)+1 
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We have, with summation from 1 — N/2 to N/2, 
= 4x? 27k 2n el — (2k 
EANAN = D = w) m (= = w) D(F) 
eo 22mg BY =) y 2al Ek 
N N Ny N N (6A.4) 


An? Qnk $ Ink \|" 27k 
T A A = lint — E A pe 
N? [F o») | “(F) (=) 


using (6.31) and neglecting the term (N). 
oe 2m f i 
EANAN © W WE — eo) PulE) Pr (E)dE ~ y WOD Pu lwo) Pu (wo) 


Returning to the integral form, we thus have, using (6.47) 


using (6.56) and the fact that, for large y, W, (Ẹ) is concentrated around € = 0. 
The denominator of (6A.2) approximately equals ®, (wp) for the same reason. 


We thus find that 
mG (1/N)W(y) Py lwo) Py (wo) 
tny] a D 
Var[G n (e ™)] TAOS 


and (6.59) has been established. 
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PARAMETER ESTIMATION 
METHODS 


Suppose a set of candidate models has been selected. and it is parametrized as a model 
structure (see Sections 4.5 and 5.7). using a parameter vector 9. The search for the 
best model within the set then becomes a problem of determining or estimating 0. 
There are many different ways of organizing such a search and also different views 
on what one should search for. In the present chapter we shall concentrate on the 
latter aspect: what should be meant by a “good model”? Computational issues (i.e.. 
how to organize the actual search) will be dealt with in Chapters 10 and 11. The 
evaluation of the properties of the models that result under various conditions and 
using different methods is carried out in Chapters 8 and 9. In Chapter 15 we return 
to the estimation methods, and give a more user-oriented summary of recommended 
procedures. 


2721 GUIDING PRINCIPLES BEHIND PARAMETER ESTIMATION METHODS 


Parameter Estimation Methods 

We are now in the situation that we have selected a certain model structure “M. with 
particular models M(@) parametrized using the parameter vector 9 € Da; C RË. 
The set of models thus defined is 


M* = {M(6)|@ € Day} (7.1) 


Recall that each model represents a way of predicting future outputs. The predictor 
could be a linear filter, as discussed in Chapter 4: 


M0) : SO) = Wla. Oy + Walg. u(t) (7.2) 

This could correspond to one-step-ahead prediction for an underlying system de- 
scription 

y(t) = Giq.0Jult) + Hq. els) (7.3) 


in which case 
W,(q,0) = [1 -— H™'(q.0)].  Walqg.0) = H™'(q.0)G(q.0) (74) 


but it could also be arrived at from other considerations. 
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The predictor could also be a nonlinear filter. as discussed in Chapter 5. in 
which case we write it as a general function of past data Z'~!: 


MO) : $(t]0) = g(t. Z1: 0) (7.5) 


The model M(@) may also contain (model) assumptions about the character of 
the associated prediction errors, such as their variances (A(@)) or their probability 
distribution (PDF f,(x. @)). l 

We are also in the situation that we have collected, or are about to collect. a 
batch of data from the system: 


= [y(1), u(1), y(2). u(2),..., y(N).u(N)] (7.6) 


The problem we are faced with is to decide upon how to use the information contained 
in Z to select a proper value Oy of the parameter vector, and hence a proper 


member M(6y) in the set M*. Formally speaking, we have to determine a mapping 
from the data Z” to the set Dag: 


Z“ —> ôy € Dm (7.7) 
Such a mapping is a parameter estimation method. 


Evaluating the Candidate Models 


We are looking for a test by which the different models’ ability to “describe” the 
observed data can be evaluated. We have stressed that the essence of a model is its 
prediction aspect, and we shall also judge its performance in this respect. Thus let 
the prediction error given by a certain model M(6,) be given by 


E(t. O) = y(t) — F164) / (7.8) 


When the data set Z* is known, these errors can be computed fort = 1,2,....N. 

A “good” model, we say, is one that is good at predicting, that is. one that pro- 
duces smal] prediction errors when applied to the observed data. Note that there is 
considerable flexibility in selecting various predictor functions. and this gives a cor- 
responding freedom in defining “good” models in terms of prediction performance. 
A guiding principle for parameter estimation thus is: 


Based on Z' we can compute the prediction error é(t, 0) using (7.8). 
At time t = N, select Êy so that the prediction errors e(t, 6y), t = 
1,2...., N, become as small as possible. (7.9) 


The question is how to qualify what “small” should mean. In this chapter we shall 
describe two such approaches. One is to form a scalar-valued norm or criterion 
function that measures the size of €. This approach is dealt with in Sections 7.2 to 
7.4. Another approach is to demand that e(/, Ôn) be uncorrelated with a given data 
sequence. This corresponds to requiring that certain “projections” of e(t, Ôn) are 
zero and is further discussed in Sections 7.5 and 7.6. 
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32 MINIMIZING PREDICTION ERRORS 


The prediction-error sequence in (7.8) can be shi as a vector in R`. The “size” of 
this vector could be measured using any norm in R“ . quadratic or nonquadratic. This 
leaves a substantial amount of choices. We shall restrict the freedom somewhat by 
only considering the following way of evaluating “how large” the prediction-error 
sequence is: Let the prediction-error sequence be filtered through a stable linear 
filter L(g): 

er(t,@) = Li(g)e(t. 8). l<t<WN (7.10) 


Then use the following norm: 


N 
1 
Vy(0.Z%) = X Elert. 6) (7.11) 


where €(-) is a scalar-valued | (typically positive) function. 

The function Vy(0. ZY) is. for given Z“ a well-defined scalar-valued function 
of the model parameter @. It is a natural measure of the validity of the model M(@). 
The estimate Êy is then defined by minimization of (7.11): 


by = 6y(Z*) = arg min Vx(6. Z“) (7.12) 
AEDn 
Here arg min means “the minimizing argument of the function.” If the minimum is 
not unique. we let arg min denote the set of minimizing arguments. The mapping 
(7.7) is thus defined implicitly by (7.12). 
This way of estimating 9 contains many well-known and much used procedures. 
We shall use the general term prediction-error identification methods (PEM) for the 
family of approaches that corresponds to (7.12). Particular methods, with specific 
“names” attached to themselves. are obtained as special cases of (7.12), depending 
on the choice of €(-), the choice of prefilter L(-), the choice of model structure, and, 
in some cases, the choice of method by which the minimization is realized. We shall 
give particular attention to two especially well known members in the family (7.12) 
in the subsequent two sections. First. however, let us discuss some aspects on the 
choices of L(g) and €(-) in (7.10) and (7.11). See also Section 15.2. 


Choice of L 


The effect of the filter L is to allow extra freedom in dealing with non-momentary 
properties of the prediction errors. Clearly. if the predictor is linear and time invari- 
ant. and y and w are scalars. then the result of filtering €, is the same as first filtering 
the input-output data and then applying the predictors. 

The effect of L is best understood in a frequency-domain interpretation and a 
full discussion will be postponed to Section 14.4. It is clear, however. that by the use 
of L. effects of high-frequency disturbances, not essential to the modeling problem. 
or slow drift terms and the like. can be removed. It also seems reasonable that certain 
properties of the models may be enhanced or suppressed by a properly selected L. 
L thus acts like frequency weighting. 
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The following particular aspect of the filtering (7.10) should be noted. If a 
model (7.3) is used. the filtered error €f (t.8) is given by 


Er (t.9) = Lig)e(t.6) = [L-'qH(q.6)] (E) — Gq. u) 713) 


The effect of prefiltering is thus identical to changing the noise model from Hiq. 9) 
to 


H1(q.6) = L'H (4.0) (7.14) 


When we describe and analyze methods that employ general noise models in 
linear systems. we shall usually confine ourselves to L(q) = 1. since the option of 
prefiltering is taken care of bv the freedom in selecting H(g.@). A discussion of the 
use and effects of L(g) in practical terms will be given in Section 14.4, 


Choice of £ 
For the choice of €(-), a first candidate would be a quadratic norm: 


Ele) = te? (7.15) 


and this is indeed a standard choice. which is convenient both for computation and 
analysis. Questions of robustness against bad data may, however. warrant other 
norms, which we shall discuss in some detail in Section 15.2. One may also conceive 
situations where the “best” norm is not known beforehand so that it is reasonable to 
parametrize the norm itself: 

£(e.@) (7.16) 


Often the parametrization of the norm is independent of the model parametrization: 


0 = H €(e(t.0).0) = €(e(t.6'). a) (7.17) 


ld 


, ; ee ee j 
An exception to this case is given in Problem 7E.4. 


Time-varying Norms 


It may happen that measurements at different time instants are considered to be of 
varying reliability. The reason may be that the degree of noise corruption changes 
or that certain measurements are less representative for the system's properties. In 
such cases we are motivated to let the norm € be time varying: 


N 
1 
“~” -2 ; 
Vx(0. Zï) = 7 D aa (7.15) 
In this way less reliable measurements can be associated with less weight in the 
criterion. 
We shall frequently work with a criterion where the weighting is made explicitly 
by a weighting function (N. t): 


N 
Vy(0, ZY) = > B(N, DE (E(t. 8), 8) (7.19) 


i=l 
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For fixed N, the N-dependence of B(N. t) is of course immaterial. However. when 
estimates Êy for different N are compared. as for example in recursive identification 
(see Chapter 11), it becomes interesting to discuss how (N. t) varies with N. We 
shall return to this issue in Section 11.2. 


Frequency-domain Interpretation of Quadratic Prediction-error 
Criteria for Linear Time-invariant Models 


Let us consider the quadratic criterion error (7.12) and (7.15) for the standard linear 
model (7.3) 


N 
. l 
Vy(0,Z%) = — X lelto) 
N 2 (7.20) 
e(t.0) = Hg, O [x — Gq. 8Ju()]) 
Let En (27k/N,6),k =0,1,.... N — 1. be the DFT of e(t.0)} t =1,2..... N: 


N 
1 i 
En(2xnk/N.0) = or 5 elt Oje iN 
; f=] 


Then. by Parseval’s relation (2.44), 


N-1 
[En (22k/N, 0)? (7.21) 


k=1 


1 


1 
Vy(0. Z") = —= 
n ( ) N3 


Now let 
w(t,0) = G(q,9)u(t) 


Then the DFT of w(t, @) is, according to Theorem 2.1, 
Wy (w, 0) = Gle, 0)Uy (w) + Ry(@) 
with 


C 
|[Rv(@)| < — 


JN 


The DFT of s(t, 9) = y(t) — w(t, 0) then is 
Sy(w.0) = Yu(w) — Ge”, 0)Un(w) — Ry(w) 


Finally, 
e(t,0) = H~'(g.0)s(t. 0) 


has the DFT, again using Theorem 2.1, 


En(w) = H7'(e!®.0)Sy(@.0) + Ryo) 
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with 
| C 


z 


Inserting this into (7.21) gives 


N-I 


| 1 l EP E E. 
Nv si Qik iN | 
VWO Z = 2 |H (e70) 


x |Yn(2xk/N) — GETEN O)UnQak/N)| + Ry 


with |Ry| < C//N, or, using the definition of the ETFE Gy in (6.24). 


N-! 

PA. ose arai 

V6. Z“) = — 9 {5 fencer) — G(e*t!®,6)| 
k=0 


x On(2ak/N,0) + Ry | (7.22 
with 
2 
On(w,0) = eNO. ; (7.23) 
lH (ei, @)| 


First notice that, apart from the remainder term Ry, the expression (7.22) coincides 
with the weighted least-squares criterion for a model: 


Gn (eN) = GEIEN Oy + vlk) (7.24) 


Compare with (IT.96) and (11.97). According to Lemma 6.1, the variance of u(k) is. 
asymptotically, ®,.(27k/N)/ |Uy (277k/N)|°, so the weighting coefficient Q y (w. 4) 


is the inverse variance, which is optimal for linear regressions, according to (11.65). In 
(7.23) the unknown noise spectrum ®,(w) is replaced by the model noise spectrum 


| H(e'”, 0) fe Consequently, the prediction-error methods can be seen as methods of 
fitting the ETFE tothe model transfer function with a weighted norm, corresponding 
to the model signal-to-noise ratio at the frequency in question. For notational 
reasons, it is instructive to rewrite the sum (7.22) approximately as an integral: 


1 f7 114 ; ; 2 
Vn. 2") ~ = | | |Exe - Get o| On(o.6)do 0125) 
—T 


The shift of integration interval from (0, 2:r) to (—7, 7) is possible since the inte- 
grand is periodic. 
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With this interpretation we have described the prediction-error estimate as an 
alternative way of smoothing the ETFE. showing a strong conceptual relationship 
to the spectral analysis methods of Section 6.4. See Problem 7G.2 for a direct tie. 

When we specialize to the case of a time series [no input and G(q. 6) = 0], the 
criterion (7.25) takes the form 

1 T 


Vx (0, Z^) = — 
al ) zx J 


Yy (w) 


—— 7.26 
He, 6)} ( 2 ) 


Such parametric estimators of spectra are known as “Whittle-type estimators.” after 
Whittle (1951). 

In Section 7.8 we shall return to frequency domain criteria. There, however. 
we take another viewpoint and assume that the observed data are in the frequency 
domain. being Fourier transforms of the input and output time domain signals. 


Multivariable Systems (+) 
For multioutput systems. the counterpart of the quadratic criterion is 


tle) = teT Ale (7.27) 


for some symmetric, positive semidefinite p x p matrix A that weights together the 
relative importance ot the components of €. 

One might discuss what is the best choice of norm A. We shall do that in some 
detail in Section 15.2. Here we only remark that, just as in (7.16), the parameter 
vector @ could be extended to include components of A, and the function @ will then 
be an appropriate function of @. 

As a variant of the criterion (7.11), where a scalar £(€) is formed for each t, 
we could first form the p x p matrix 


N 
QĘx(0. Z“) = ye Aye" (t, 0) (7.28) 
and let the criterion be a scalar-valued function of this matrix: 
Vy(9.Z*) = h(Qn(0.Z*)) (7.29) 
The criterion (7.27) is then obtained by 
h(Q) = 5tr(QA~') (7.30) 


23 LINEAR REGRESSIONS AND THE LEAST-SQUARES METHOD 


Linear Regressions 


We found in both Sections 4.2 and 5.2 that linear regression model structures are 
very useful in describing basic linear and nonlinear systems. The linear regression 
employs a predictor (5.67) 


FEIO = p00 + u(r) (7.31) 
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that is linear in 6. Here g is the vector of regressors, the regression vector. Recall 
that for the ARX structure (4.7) we have 


g(t) =[-y@-1) -y( 2)... yla) u(t—1)...u(t— my) F732) 


In (7.31). a(t) is a known data-dependent vector. For notational simplicity we shall 
take u(t) = 0 in the remainder of this section: it is quite straightforward to include 
it. See Problem 7D.1. 

Linear regression forms a standard topic in statistics. The reader could consult 
Appendix IT for a refresher of basic properties. The present section can, however, 
be read independently of Appendix IL. 


Least-squares Criterion 
With (7.31) the prediction error becomes 
e(t.0) = x(t) — ge 
and the criterion function resulting from (7.10) and (7.11). with L(g) = 1 and 
e(e) = 382. is 
N 
2 


N 1 TAI 
Vy(@,2") = = 2a DO — 9716] (7.33) 


t=1 


This is the /east-squares criterion for the linear regression (7.31). The unique feature 
of this criterion, developed from the linear parametrization and the quadratic crite- 
rion, is that it is a quadratic function in 8. Therefore, it can be minimized analytically, 
which gives, provided the indicated inverse exists, 


N N 


—1 
A = : Ny 1 ; } , 
OES = argminVy(6.Z") = [$ Zro] FLID (t) (7.34) 


t=1 t=1 


the least-squares estimate {LSE) (see Problem 7D.2). 
Introduce the d x d matrix 


N 
R(N) = 5 Dene" (7.35) 
and the d-dimensional column vector 
ie 
f= 5 Drevin (7.36) 
In the case (7.32), y(t) contains lagged input and output variables, and the entries 


of the quantities (7.35) and (7.36) will be of the form 


N 


1 
[RIM]; = GB ve - Dye - A. Lsijsna 
z=1 
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and similar sums of u(t — r)- u(t — s) or u(t — r) - y(t — s) for the other entries of 
R(N). That is, they will consist of estimates of the covariance functions of {¥(r)} and 
{u(t)}. The LSE can thus be computed using only such estimates and is therefore 
related to correlation analysis, as described in Section 6.1. 


Properties of the LSE 


The least-squares method is a special case of the prediction-error identification 
method (7.12). An analysis of its properties is therefore contained in the general 
treatment in Chapters 8 and 9. It is, however. useful to include a heuristic investiga- 
tion of the LSE at this point. 


Suppose that the observed data actually have been generated by 
yt) = 9 (t)O + volt) (7.37) 


for some sequence {vo(f)}. We may think of 9 as a “true value” of the parameter 
vector. 
As in (1.14)-(1.15) we find that 


N 
R 1 
. LS _ — a -1 se , — (R*)-! f* 
jim ON Oo îm R (N) N p(t)up{t) = (R) f*. 


R* = Egge. f* = Egv(t)vo(t) (7.38) 


provided vo and ¢ are quasi-stationary, so that Theorem 2.3 can be applied. For the 
LSE to be consistent, that is, for OL’ to converge to 6), we thus have to require: 


i. R* is non-singular. This will be secured by the input properties, as in (1.17)- 
(1.18), and discussed in much more detail in Chapter 13. 


ii. f* = 0. This will be the case if either: 


(a) {vo(t)} is a sequence of independent random variables with zero mean 
values (white noise). Then vg(t) will not depend on what happened up 
to time t — 1 and hence Egy(t)uy(t) = 0. 


(b) The input sequence {u(t)} is independent of the zero mean sequence 
{vo(t)} and na = Q in (7.32). Then g(t) contains only u-terms and hence 
Eg(t)up(t) = 0. 


When na > Osothat g(t) contains y(k), t— nna < k <t—1, and u(t) is not white 
noise, then (usually) Eg(t)vo(t) Æ 0. This follows since g(t) contains v(t — 1), 
while y(t — 1) contains the term vg{t — 1) that is correlated with up(t). Therefore, 
we may expect consistency only in cases (a) and (b). 
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Weighted Least Squares 


Just as in (7.18) and (7.19), the different measurements could be assigned different 
weights in the least-squares criterion: 


N 
; le 2 
WO,Z = = da [ke -— yne] (7.39) 
or 
= ? 
Vv. Z~) = X BN.) [y@) - ooe] (7.40) 
t=1 


The expression for the resulting estimate is quite analogous to (7.34): 


t=} t=] 


N -l y 
os = [Erno] >> BIN, t)p(t) y(t) (7.41 


Multivariable Case (+) 


If the output y(t) is a p-vector and the norm (7.27) is used, the LS criterion take: 
the form 


tov 
WeZ“ = — Y)5 [ye - ene] AT [ya — ee] a 


t=1 
This gives the estimate ‘ 
/ 
iw =I iw 
6s = | — NAT (1) | — tA y(t 7.43 
N PXS ¢ (t) N 200 y(t) ( 
In case we use the particular parametrization (4.56) with 8 as an r x p matrix, 
§(t]8) = 07 y(t) (7.44 
the LS criterion becomes 
Le 2 
Wr vis, dA al < 
Vy(@, Z“) = p Lb a7 y(1)|| (7.45 


with the estimate 


N -1 N 

Z 1 1 

oy = È y voro] x DOO) (7.46 
t=l t=1 
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(see problem 7D.2). The expression (7.46) brings out the advantages of the structure 
(7.44): To determine the r x p estimate 04. it is sufficient to invert an r x r matrix. 
In (7.43) 0 isa p-r vector and the matrix inversion involves a pr x pr matrix. 


Colored Equation-error Noise (+) 


The LS method has many advantages, the most important one being that the global 
minimum of (7.33) can be found efficiently and unambiguously (no local minima 
other than global ones exist). Its main shortcoming relates to the asymptotic prop- 
erties quoted previously: If, in a difference equation, 


y(t) +ary(t — 1) +--+ + an v(t — na) 
= bu(t —1)+--- + but — nyo) + v(t) (7.47) 


the equation error u(t) is not white noise, then the LSE will not converge to the true 
values of a; and b;. To deal with this problem, we may incorporate further modeling 
of the equation error u(t) as discussed in Section 4.2, let us say 


u(t) = K(qg)e(t) (7.48) 


with e white and «x linear filter. Models employing (7.48) will typically take us out 
from the LS environment, except in two cases. which we now discuss. 


Known noise properties: If in (7.47) and (7.48) a; and b; are unknown. but x is a 
known filter (not too realistic a situation), we have 


A(g)x(t) = Big)u(t) + K(qe(t) (7.49) 
Filtering (7.49) through the filter x~! (q) gives 
A(q@)yr(t) = Biq)ur(t) + elt) (7.50) 
where 
yet) = (gy), ur (t) = Kaule) (7.51) 


Since ¢ is white, the LS method can be applied to (7.50) without problems. Notice 
that this is equivalent to applying the filter L(g) = «~1(q) in (7.10). 


High-order models: Suppose that the noise v can be well described by x(q) = 
1/D(q) in (7.48). where D(q) is a polynomial of degree r. [That is, u(t) is supposed 
to be an autoregressive (AR) process of order r.] This gives 
1 
D(q) 


A(q)¥(t) = Biq)u(t) + e(t) (7.52) 


or 
A(q)D(q)y(t) = Big) D(q)utt) + elt) (7.53) 
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Applying the LS method to (7.53) with orders n4 =n, +r and ng = np +F gives 
since e is white. consistent estimates of AD and BD. Hence the transfer function 
from u to y. 

BDD) _ Bq) 


A(q)D(qg) Alq) 


is correctly estimated. This approach was called repeated least squares in Åström 
and Eykhoff (1971). See also Söderström (1975b)and Stoica (1976). 


Estimating State Space Models Using Least Squares Techniques 
(Subspace Methods) 


A linear system can always be represented in state space form as in (4.84): 
x(t +1) = Ax(t) + Bu(t) + wit) 


(7 54) 
y(t) = Cx(t) + Du(t) + v(t) 


with white noises w and v. Alternatively we could just represent the input-output 
dynamics as in (4.80): 


x(t +1) = Ax(t) + Bu(t) 
y(t) = Cx(t) + Du(t) + v(t) 


where the noise at the output, v. very well could be colored. It should be noted that 
the input-output dynamics could be represented with a lower order model in (7.55) 
than in (7.54) since describing the noise character might require some extra states. 

To estimate such a model. the matrices can be parameterized in ways that 
are described in Section 4.3 or Appendix 4A—either from physical grounds or as 
black boxes in canonical forms. Then these parameters can be estimated using the 
techniques dealt with in Section 7.4. f 

However. there are also other possibilities: We assume that we have no insight 
into the particular structure, and we would just estimate any matrices A, B, C. and D 
that give a good description of the input-output behavior of the system. Since there 
are an infinite number of such matrices that describe the same system (the similarity 
transforms). we will have to fix the coordinate basis of the state-space realization. 

Let us for a moment assume that not only are u and y measured. but also the 
sequence of state vectors x. This would, by the way. fix the state-space realization 
coordinate basis. Now. with known u. y and x. the model (7.54) becomes a linear 
regression: the unknown parameters, all of the matrix entries in all the matrices, mix 
with measured signals in linear combinations. To see this clearly. let 


: 1 
ro = |r | o= [4 A 
y(t) C D 


P(t) = . Elt) = 
u(t) v(t) 
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Then, (7.54) can be rewritten as 
Y(t) = O(t) + E(t) (7.56) 


From this, all the matrix elements in © can be estimated by the simple least squares 
method (which in the case of Gaussian noise and known covariance matrix coincides 
with the maximum likelihood method), as described above in (7.44)-(7.46). The 
covariance matrix for £(t) can also be estimated easily as the sample sum of the 
squared model residuals. That will give the covariance matrices as well as the cross 
covariance matrix for w and v. These matrices will. among other things, allow us 
to compute the Kalman filter for (7.54). Note that all of the above holds without 
changes for multivariable systems, i.e., when the output and input signals are vectors. 

The problem is how to obtain the state vector sequence x. Some basic realiza- 
tion theory was reviewed in Appendix 4A. from which the essential results can be 
quoted as follows: 

Let a system be given by the impulse response representation 


yO) = Y Puut — j) + hejet — j) (7.57) 


j=0 


where u is the input and e the innovations. Let the formal k-step ahead predictors 
be defined by just deleting the contributions to y(t) from e(j),u(j): j = f...., 
t—k+1: 


Sat — O = Do uult — j) + heelt — j) (7.58) 
j=k 


No attempt is thus made to predict the inputs u( j): j =t,....t — k + 1 from past 
data. Define 


Salt — 1) 
Êa) = (7.59a) 
yet +r -— ijt — 1) 
Y= [20 oh P) | (7.59b) 


Then the following is true as N — 20 (see Lemmas 4A.1 and 4A.2 and their proofs): 


L. The system (7.57) has an nth order minimal state space description if and only 
if the rank Y is equal to n for all r > n. 
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2. The state vector of any minimal realization in innovations form can be chosen 
as linear combinations of Y, that form a row basis for Y, i.e., 


x(t) = LY,(t) (7.60) 


where the n x pr matrix L is such that LY spans Y. ( p is the dimension of 
the output vector y(¢).) 


Note that the canonical state space representations described in Appendix 4A cor- 


respond to L matrices that just pick out certain rows of Y,. In general, we are not 
confined to such choices, but may pick L so that x(t) becomes a well-conditioned 
basis. 


It is clear that the facts above will allow us to find a suitable state vector from 
data. The only remaining problem is to estimate the k-step ahead predictors. The 
true predictor }(¢ + k — 1|t — 1) is given by (7.58). The innovation e(j) can be 
written as a linear combination of past input-output data. The predictor can thus 
be expressed as a linear function of u(i), y(i), i < t — 1. For practical reasons the 
predictor is approximated so that it only depends on a fixed and finite amount of past 
data, like the sı past outputs and the sz past inputs. This means that it takes the form 


ye tk —1r—-—1) = ay —1) +... tas, y(t — s) 
+ iut —1) +... + pault — s) (7.61) 


This predictor can then efficiently be determined by another linear least squares 
projection directly on the input output data. That is. set up the model 


yt +k —1) = Of g(t) + yg Uit) + git +k- 1) (7.62) 
or, dealing with all r predictors simultaneously 
Y,(t) = Opt) + TU) + E(t) (7.63) 


Here: 


T 
P(t) = |x"e —1)... yt —s) u(t — 1)... ut — s)| (7.64a) 


T 
U(t) = [ ue) ut tea D| (7.64b) 
T 
Y, (t) = = Ta)... yt tr —- 1) | (7.64c) 
= [a.o]. T= [n] (7.64d) 
T 
E(t) = eim. et +r- » | (7.64e) 
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Moreover. £ is the number. typically equal to r. of input values whose influence on 
Y,(t) is to be accounted for. Now, © and F in (7.63) can be estimated using least 


squares, giving Ôx and fy. The k-step ahead predictors are then given by 
Y(t) = Ovgs(t) (7.65) 


For large enough s, this will give a good approximation of the true predictors. 

Remark 1: The reason for the term U, is as follows: The values of 
u(t +1),....u(t +k) affect y(t + k — 1). If these values can be predicted from 
past measurements—which is the case if u is not white noise—then the predictions 
of y(t + k — 1) based on past data will account also for the influence of Ug. If we 
estimate (7.61) directly, this influence will thus be included. However, as demanded 
by (7.58). the influence of U; should be ignored in the “formal” k-step ahead pre- 
dictor we are seeking. This is the reason why this influence is explicitly estimated in 
(7.62) and then thrown away in the predictor (7.65). 


Remark 2: If we seek a state-space realization like (7.55) that does not model 
the noise properties—an output error model—we would just ignore the terms e(t — j) 
in (7.57)-(7.58). This implies that the predictor in (7.62) would be based on past 
inputs only. i.e. 5; = 0 in (7.64). 


The method thus consists of the following steps: 


Basic Subspace Algorithin (7.66) 


1. Choose sı. 52. r and € and form F,(r) in (7.65) and Y as in (7.59). 


2. Estimate the rank n of Y and determine L in (7.60) so that x(t) corresponds 
to a well-conditioned basis for it. 


3. Estimate A, B.C, D and the noise covariance matrices by applying the LS 
method to the linear regression (7.56). 


What we have described now is the subspace projection approach to estimating the 
matrices of the state-space model (7.54). including the basis for the representation 
and the noise covariance matrices. There are a number of variants of this approach. 
See among several references, e.g. Van Overschee and DeMoor (1996), Larimore 
(1983), and Verhaegen (1994). 

The approach gives very useful algorithms for model estimation, and is partic- 
ularly well suited for multivariable systems. The algorithms also allow numerically 
very reliable implementations. and typically produce estimated models with good 
quality. If desired, the quality may be improved by using the model as an initial 
estimate for the prediction error method (7.12). Then the model first needs to be 
transformed to a suitable parameterization. 

The algorithms contain a number of choices and options, like how to choose 
£, s; and r.and also how to carry out step number 3. There are also several “tricks” to 
do step 3 so as to achieve consistent estimates even for finite values of s;. Accordingly, 
several variants of this method exist. In Section 10.5 we shall give more algorithmic 
details around this approach. 
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7.4 A STATISTICAL FRAMEWORK FOR PARAMETER ESTIMATION AND 


THE MAXIMUM LIKELIHOOD METHOD 


So far we have not appealed to any statistical arguments for the estimation of 4. In 
fact, our framework of fitting models to data makes sense regardless of a stochastic 
setting of the data. It is, however, useful and instructive at this point to briefly describe 
basic aspects of statistical parameter estimation and relate them to our framework, 


Estimators and the Principle of Maximum Likelihood 


The area of statistical inference, as well as that of system identification and parameter 
estimation, deals with the problem of extracting information from observations that 
themselves could be unreliable. The observations are then described as realizations 
of oe variables. naa that the observations are represented by the random 
variable y^ = (y(1). y Che , ¥(N)) that takes values in R^. The probability 
density function (PDF) of x” is ‘supposed to be 


fO; %1.%2, 0...) = fye(Oi ek) (7.67) 
That is, 


p(y’ e A) = f f0; x8) dx” (7.68) 
x*eA 


In (7.67). @ is a d-dimensional parameter vector that describes properties of the 
observed variable. These are supposed to be unknown, and the purpose of the 
observation is in fact to estimate the vector @ using y™. This is accomplished by an 
estimator, 


6(y") / (7.69) 


which is a function from R™ to Rf. If the observed value of y™ is y^. then conse- 
quently the resulting estimate is 6, = 6( yN). 

Many such estimator functions are possible. A particular one that maximizes 
the probability of the observed event is the celebrated maximum likelihood estima- 
tor, introduced by Fisher (1912). It can be defined as follows: The joint probability 
density function for the random vector to be observed is given by (7.67). ahe prob- 
ability that the realization (= observation) indeed should take the value yi is thus 
proportional to 


FACHE) 


This is a deterministic function of 8 once the numerical value y is inserted. This 
function is called the likelihood function. It reflects the ` likelihood” that the ob- 
served event should indeed take place. A reasonable estimator of 8 could then be to 
select it so that the observed event becomes “as likely as possible.” That is, we seek 


Ôu OY ) = arg max fr: 5.) (7.70) 
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where the maximization is performed for fixed yo . This function is known as the 
maximum likelihood estimator (MLE). 


An Example 


Let y(i) i =1..... N. be independent random variables with normal distribution 
with (unknown) means 4 (independent of 7) and (known) variances A;: 


yi) € N(@. Ài) (7.71) 
A common estimator of 9% is the sample mean: 


N 
A AY 1 r 
Osm(¥*) = N y y(i) (7.72) 


i=l 


To calculate the MLE. we start by determining the joint PDF (7.67) for the observa- 
tions. Since the PDF for y(i) is 


and the y(i) are independent, we have 


N (x; = 8)? 
FO; y= are exp | - 2A; | (7.73) 


The likelihood function is thus given by f, (0; vi). Maximizing the likelihood func- 
tion is the same as maximizing its logarithm. Thus 


ÔmL(x™) = arg max log f, (8: y”) 


N 


N , 5 

N 1 1 (y(i)— 8) 
—— lop? -5 -= co er 74 
emp | 5 og 271 a a 5 7 la ) 


from which we find 


Ne = ds 
Ôu") = aw ML, (7.75) 


N Ài 


Yaan 
i=] 


Relationship to the Maximum A Posteriori (MAP) Estimate 


The Bavesian approach gives a related but conceptually different treatment of the 
parameter estimation problem. In the Bayesian approach the parameter itself is 
thought of as a random variable. Based on observations of other random variables 
that are correlated with the parameter. we may infer information about its value. 
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Suppose that the properties of the observations can be described in terms of a pa- 
rameter vector 0. With a Bayesian view we thus consider @ to be a random vector 
with a certain prior distribution (“prior” means before the observations have been 
made). The observations y™ are obviously correlated with this 6. After the obser- 
vations have been obtained. we then ask for the posterior PDF for @. From this 
posterior PDF. different estimates of 6 can be determined. for example. the value 
for which the PDF attains its maximum (“the most likely value”). This is known as 
the maximum a posteriori (MAP) estimate. 


Suppose that the conditional PDF for y”, given @. is 
FO; x^) = P(y* = x* 10) 
and that the prior PDF for @ is 
goz) = P0 =z) 


[Here P(A|B) = the conditional probability of the event A given the event B. We 
also allowed somewhat informal notation.] Using Bayes’s rule (1.10) and with some 
abuse of notation, we thus find the posterior PDF for 9, i.e., the conditional PDF for 
@, given the observations: 


P(y|0) - P@) 


Ny 


~ f(O: y”) + 96(6) (7.76) 


The posterior PDF as a function of @ is thus proportional to the likelihood function 
multiplied by the prior PDF. Often the prior PDF has an insignificant influence. Then 
the MAP estimate / 


Omar") = arg max | fx (8: y”) + g6(8)} (7.77) 
is close to the MLE (7.70). 


Cramér-Rao Inequality 


The quality of an estimator can be assessed by its mean-square error matrix: 


P=E [2o = J 80") = | (7.78) 


Here 6 denotes me ‘true value of @, and (7.78) is evaluated under the assumption 
that the PDF of y“ is Fy (oi ¥ ab 

We may be interested in selecting estimators that make P small. It is then 
interesting to note that there is a lower limit to the values of P that can be obtained 
with various unbiased estimators. This is the so called Cramér-Rao inequality: 
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Let 6( y“) bean estimator of 8 such that E6(y*) = 0, where E evaluates 
the mean, assuming that the PDF of y” is Fy (60; yY) (to hold for all 
values of 9), and suppose that y™ may take values in a subset of RY, 
whose boundary does not depend on 0. Then 


E[8c*) — @] fcr") — @] = M (7.79) 


where 


X 
H 


E| Zo fx(0; "| Fa f,(0; y) i 
dg tt ig ee? 


e (7.80) 


4 
a 


d N 
= -E Unies hr@.y) 


9=6 


Since 8 is a d-dimensional vector, (d/d6) log f, (9: y™) is a d-dimensional column 

vector and the Hessian (d*/d6*) log FO: y^) isa d x d matrix. This matrix M is 

known as the Fisher information matrix. Notice that the evaluation of M normally 

requires knowledge of 8, so the exact value of M may not be available to the user. 
A proof of the Cramér-Rao inequality is given in Appendix 7A. 


Asymptotic Properties of the MLE 


It is often difficult to exactly calculate properties of an estimator, such as (7.78). 
Therefore, limiting properties as the sample size (in this case the number N ) tends 
to infinity are calculated instead. Classical such results for the MLE in case of inde- 
pendent observations were obtained by Wald (1949)and Cramér (1946): 


Suppose that the random variables {y(i)} are independent and identically 
distributed, so that 


N 
Fy (3 xi... XN) = | | Awe. xi) 


i=l 
Suppose also that the distribution of y“ is given by Fy (8: x) for some 


value 09. Then the random variable Bu. ( ad ) tends to Qo with probability 
1 as N tends to infinity, and the random variable 


VN [Ou ir) — 60] 


converges in distribution to the normal distribution with zero mean and 
covariance matrix given by the Cramér-Rao lower bound [M~ in (7.79) 
and (7.80)]. 


In Chapters 8 and 9 we will establish that these results also hold when the ML 
estimator is applied to dynamical systems. In this sense the MLE is thus the best 
possible estimator. Let it, however, also be said that the MLE sometimes has been 
criticized for less good small sample properties and that there are other ways to assess 
the quality of an estimator than (7.78). 
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Probabilistic Models of Dynamical Systems 


Suppose that the models in the model structure we have chosen in Section 7.1 include 
both a predictor function and an assumed PDF for the associated prediction errors, 
as described in Section 5.7: 


MO): lO) = g(t. Z'~':@) 
e(t.8) = y(t) — ¥(t|@) are independent (7.81) 
and have the PDF f.(x. t: 08) 


Recall that we term a model like (7.81) that includes a PDF for € a (complete) 
probabilistic model. 


Likelihood Function for Probabilistic Models of Dynamical 
Systems 


We note that. according to the model (7.81). the output is generated by 
y(t) = g(t, Z1! 0) + e(t.6) (7.82) 


where é(t. 0) has the PDF fe(x, t: 0). The joint PDF for the observations y” (given 
the deterministic sequence u” ) is then given by Lemma 5.1. By replacing the dummy 
variables x; by the corresponding observations y(i), we obtain the likelihood func- 
tion: 


N 
F: M = [[ LOD - g(t. Z: 8). 1:8) 


t=1 


(7.83) 
N 
= |] fle, o). 0) 
t=} 
Maximizing this function is the same as maximizing 
1 ie 
FN -L . 
yee FO") = = 2 og fe(e(t. @). t: 6) (7.84) 
If we define 
£(€,0.f) = — log f.(e, t: 8) (7.85) 
we may write 
bu. y= are min 1 eet 8).@.t) (7.86) 


i=] 


The maximum likelihood method can thus be seen as a special case of the prediction- 
error criterion (7.12). 
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It is worth stressing that (7.85) and (7.86) give the exact maximum likelihood 
method for the posed problem. It is sometimes pointed out that the exact likelihood 
function is quite complicated for time-series problems and that one has to resort to 
approximations of it (e.g.. Kashyap and Rao, 1976: Akaike. 1973; Dzhaparidze and 
Yaglom, 1983). This is true in certain cases. The reason is that it may be difficult 
to put, say, an ARMA model in the predictor form (7.81) (it will typically require 
time-varying Kalman predictors). The problem is therefore related to finding the 
exact predictor and is not a problem with the ML method as such. When we employ 
time-invariant predictors, we implicitly assume all previous observations to be known 
[see (3.24)] and typically replace the corresponding initial values by zero or estimate 
them. Then it is appropriate to interpret the likelihood function as conditional w.r.t. 
these values and to call the method a conditional ML method (e.g.. Kashyap and 
Rao, 1976). 


Gaussian Special Case 


When the prediction errors are assumed to be Gaussian with zero mean values and 
(t-independent) covariances A, we have 


4 


< 


1 1 
£(€.8.t) = — log fo(e.t: 9) = const + z log + T (7.87) 


A 
If A is known, then (7.87) is equivalent to the quadratic criterion (7.15). If A is 
unknown, (7.87) is an example of a parameterized norm criterion (7.16). Depending 
on the underlying model structure, A may or may not be parametrized independently 
of the predictor parameters. See Problem 7E.4 for an illustration of this. Compare 
also Problem 7E.7. 


Fisher Information Matrix and the Cramér-Rao Bound for 
Dynamical Systems 


Having established the log likelihood function in (7.84) for a model structure, we can 
compute the information matrix (7.80). For simplicity. we then assume that the PDF 
fe is known (@ independent) and time invariant. Let £o(¢) = — log fe(£). Hence 


N 
d = ; ; 
zg LEAO) = 2, alele) W(t.) 
where, as in (4.121), 
w(t. 6) ¥(t]@) 4 (t.0) [a d-dimensional col tor] 
.0) = — Ny = ——e(t.@). - v 
, 70° 70° a d-dimensional column vec 


Also, £ĉ; is the derivative of £9(€) w.r.t. €. To find the Fisher information matrix, we 
now evaluate the expectation of 


“tog F,(0: y) | log F,(: »") 
dg gp eee 
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at & under the assumption that the true PDF for y% indeed is Í: (A: viN The latter 
statement means that ¢(f. 86o} = eot) will be treated as a sequence of independent 
random variables with PDF's fe(x). Call this expectation Ma. Thus 


N N 


My = EY} G(eo(t)) €o(eo(s)) W(t. 8) (s. A) 


t=! s=1 


N 
SE [Elen] - Ewe. oy". 6) 


i=l 


since en(t) and é9(s) are independent for s 4 t. We also have £,(x) = [log f.(v)] = 


Sex) fex), and 


E [teM] = j= ~ fe(x) dx 


F(x) 
i (7.88) 
x f [EOF PA å 1 
—x te (x) i Kü 
If eo{t) is Gaussian with variance Ap, it is easy to verify that ko = ào. Hence 
Eo 
My = —- XO Evite. 6) W(t. 0) (7.89) 
0 t=1 


Now the Cramér-Rao inequality tells us that for any unbiased estimator Oy of 6 (ie. 
estimators such that E@y = 6 regardless of the true valne 6) we must have 


Covéy > Mj)! (7.90) 


Notice that this bound applies for any N and for all parameter estimation methods. 
We thus have 


N =l 
> Ko bs Eyl OT. a) 


t=1 


= ig for Gaussian innovations 


Multivariable Gaussian Case (*) 


When the prediction errors are p-dimensional and jointly Gaussian with zero mean 
and covariance matrices A. we obtain from the multivariable Gaussian distribution 


€(e.t:;@) = const + + log det A + se’ Ale (7.92) 
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Then the negative logarithm of the likelihood function takes the form 
N 
N N žij 
Vn(8. A, Z“) = const + > logdet A + 5 Ee (t.0)AT'e(t.0) (7.93) 
t=1 


If the p x p covariance matrix A is fully unknown and not parametrized through 0. 
it is possible to minimize (7.93) analytically with respect to A for every fixed 8: 


N 
a 1 
. f Ny Â, + T 
arg min V0, A.Z) = Ax) = N ) e(t.@)e'(t. 6) (7.94) 


r=] 


Then 


a 


Ôn = arg min Vy(0. Ân (0). Z”) 
ê 
(7.95) 


arg min [; log det Ay (0) + 7 
e 


(see problem 7D.3) where p = dime. Hence we may in this particular case use the 
criterion 


Êy = arg min JÈ Sa eTl, J (7.96) 
@ 


t=1 


With this we have actually been led to a criterion of the type (7.29) to (7.30) with 
h(A) = det A. 


Information and Entropy Measures (+) 


In (5.69) and (5.70) we gave a general formulation of a model as an assumed PDF 
for the observations Z’: 


fn (t. Z') (7.97) 


Let fo(t. Z!) denote the true PDF for the observations. The agreement between 
two PDF's can be measured in terms of the Kullback-Leibler information distance 
(Kullback and Letbler, 1951): 


I( fo: fm) = | he. x‘) log Lea ae, (7.98) 


Here we use x‘ as an integration variable for Z‘. This distance is also the negative 
entropy of fo with respect to fm: 


S(fo: fn) = —1 (fo; fm) (7.99) 


An attractive formulation of the identification problem is to /ook for a model 
that maximizes the entropy with respect to the true system or, alternatively. minimizes 
the information distance to the true svstem. This formulation has been pursued by 
Akaike in a number of interesting contributions Akaike (1972. 1974a. 1981). 
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With a parametrized set of models fays)(t. Z!) = f (0: t. Z'). we would thus 
solve 


b= argmin 7 (fal N, Z”); f(@: N. Z")) (7.100) 
The information measure can be written 
Ifo: f) = - fio [f(O; N.x™)] oN. x) dx" 


ag fre [FoNN x*)] < JoN.x™)dx* 
= — Eo log f(0:; N, Z^) + 8 -independent terms 


where Ey denotes expectation with respect to the true system. 
The problem (7.100) is thus the same as 


a 


Ôn = arg min [—Ep log F (0: N, Z*)] (7.101) 
a 


The problem here is of course that the expectation is not computable since the true 
PDF is unknown. A simple estimate of the expectation is to replace it by the obser- 
vation 


Eplog f(0; N. Z) ~% log f(@; N, Z) (7.102) 
This gives the log likelihood function for the problem and (7.101) then equals the 


MLE. The ML approach to identification can consequently also be interpreted as a 
maximum entropy strategy or a minimum information diétance method. 


The distance between the resulting model and the true system thus is 
I (Fon. Z“): Fev; N.Z%)) (7.103) 


This is a random variable. since ĝy depends on Z*. As an ultimate criterion of 
fit, Akaike (1981)suggested the use of the average information distance. or average 
entropy 


Ez, 1 (ToN. Z“): Fy; N. 2%) (7.104) 


This is to be minimized with respect to both the model set and bn. As an unbiased 
estimate of the quantity (7.104), he suggested 


log f (Ôn: N. Z“) — dimé (7.105) 


Calculations supporting this estimate will be given in Section 16.4. 
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The expression (7.105) used in (7.101) gives, with (7.84) and (7.85). 
N 
x ; lẹ dim 0 
(Z7) = argmin { — X £ (elt.0).t.0) + —— 7.106 
Gaic(Z") = ares ay Daf (elt. 8). 1.8) + (7.106) 


i=l 


This is Akaike`s information theoretic criterion (AIC). When applied to a given 
model structure, this estimate does not differ from the MLE in the same structure. 
The advantage with (7.106) is. however. that the minimization can be performed 
with respect to different model structures, thus allowing for a general identification 
theory. See Section 16.4 for a further discussion of this aspect. 

An approach that is conceptually related to information measures is Rissanen’s 
minimum description length (MDL) principle. This states that a model should be 
sought that allows the shortest possible code or description of the observed data. 
See Rissanen (1978. 1986). Within a given model structure, it gives estimates that 
coincide with the MLE. See also Section 16.4. 


Regularization 
Sometimes there is reason to consider the following modified version of the criterion 
(7.11): 
N 
Wv(@, Z*) = iL elertt 6)) +818 —6* = Vr (0. Z“) +810 — 0*1? (7.107) 


1=1 


It differs from the basic criterion only by adding a cost on the squared distance 
between 0 and 0#. The latter is a fixed point in the parameter space. and is often 
taken as the origin, 0° = 0. The reasons and interpretations for including such a 
term could be listed as follows: 


e If @ contains many parameters. the problem of minimizing Vy may be ill- 
conditioned, in the sense that the Hessian V,; may be an ill-conditioned ma- 
trix. Adding the norm penalty will add ôf to this matrix, to make it better 
conditioned. This is the reason why the technique is called regularization. 


e Ifthe model parameterization contains many parameters (like in the nonlinear 
black-box models of Section 5.4), it may not be possible to estimate several of 
them accurately. There are then advantages in pulling them towards a fixed 
point 9*. The ones that have the smallest influence on Vy will be affected most 
by this pulling force. The advantages of this will be brought out more clearly 
in Section 16.4. We may think of 6 as a knob by which we control the effective 
number of parameters that is used in the minimization. A large value of 6 will 
lock more parameters to the vicinity of 6°. 

e Comparing with the MAP estimate (7.77) we see that this corresponds to min- 
imizing Wy (8. Z*) = —(1/N) log [ f,.(0. Z“) - 96(6)] if we take 


? ] — z 
Vy(8.Z*) = -708 f0: Z“) {(7.108a) 


ga(O) = (N/m) e5- d = dimo (7.108b) 


221 


222 Chap.7 Parameter Estimation Methods 


that is. we assign a prior probability to the parameters that they are Gaussian 
distributed with mean 6* and covariance matrix = I. This prior is clearly well 
in line with the second interpretation. 


A Pragmatic Viewpoint 


It is good and reassuring to know that general and sound basic principles. such as 
maximum likelihood, maximum entropy, and minimum information distance. lead 
to criteria of the kind (7.11). However. in the end we are faced with a sequence of 
figures that are to be compared with “guesses” produced by the model. It could then 
always be questioned whether a probabilistic framework and abstract principles are 
applicable. since we observe only a given sequence of data, and the framework relates 
to the thought experiment that the data collection can be repeated infinitely many 
times under “similar” conditions. It is thus an important feature that minimizing 
(7.11) makes sense. even without a probabilistic framework and without “alibis” 
provided by abstract principles. 


7.5 CORRELATING PREDICTION ERRORS WITH PAST DATA 


Ideally. the prediction error é(t. 0) for a “good” model should be independent of 
past data Z’~!. For one thing, this condition is inherent in a probabilistic model, 
such as (7.81). Another and more pragmatic way of seeing this condition is that if 
€(t,@) is correlated with Z‘'~! then there was more information available in Z’~! 
about y(z) than picked up by }(t|@). The predictor is then not ideal. This leads to 
the characterization of a good model as one that produces prediction errors that are 
independent of past data. 

A test if e(z. 0) is independent of the whole (and increasing) data set Z'~! 
would amount to testing whether all nonlinear transformations of ¢(t, 9) are uncor- 
related with all possible functions of Z'~!. This is of course not feasible in practice. 

Instead. we may select a certain finite-dimensional vector sequence {f(r)} de- 
rived from Z'—' and demand a certain transformation of {e (t, 9)} to be uncorrelated 
with this sequence. This would give 


N 


Z E tOu. o) =0 (7.109) 
t=] 


and the @-value that satisfies this equation would be the best estimate Ôn based on 
the observed data. Here æœ(€) is the chosen transformation of £, and the typical 
choice would be g (€) = €. 

We may carry this idea into a somewhat higher degree of generality. In the 
first place, we could replace the prediction error with filtered versions as in (7.10). 
Second. we obviously have considerable freedom in choosing the sequence ¢(f). It 
is quite possible that what appears to be the best choice of {(7) may depend on 
properties of the system. In such a case we would let ¢ (t) depend on @. and we have 
the following method: 
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Choose a linear filter L(g) and let 
er (t,0) = Lig)e(t. @) (7.110a) 
Choose a sequence of correlation vectors 
(1,0) = g(t, Z'', 0) (7.110b) 


constructed from past data and, possibly, from @. Choose a function a(é). 
Then calculate 


Ny _ 
ool [fv(@.Z*) = 0] (7.110c) 


N 


1 
= Soo. ale (t. 69) (7.110d) 


t=] 


Here we used the notation 
sol[ f(x} = 0] = the solution(s) to the equation f(x) = 0 


Normally, the dimension of £ would be chosen so that fx is a d-dimensional 
vector (which means that { is d x p if the output is a p-vector). Then (7.110) 
has as many equations as unknowns. In some cases it may be useful to consider an 
augmented correlation sequence ¢ of higher dimension than d so that (7.110) is an 
overdetermined set of equations, typically without any solution. Then the estimate 
is taken to be the value that minimizes some quadratic norm of fy: 


Ôn = arg min | fy(6, Z“) (7.111) 


BE Dy 
There are obviously formal links between these correlation approaches and the min- 
imization approach of Section 7.2 (see. e.g., Problem 7D.6). 

The procedure (7.110) is a conceptual method that takes different shapes, de- 
pending on which model structures it is applied to and on the particular choices of €. 
In the subsequent section we shall discuss the perhaps best known representatives 
of the family (7.110), the instrumental-vaniable methods. First, however. we shall 
discuss the pseudolinear regression models. 


Pseudolinear Regressions 
We found in Chapter 4 that a number of common prediction models could be written 
as 

5110) = g™(t.0)0 (7.112) 


{see (4.21) and (4.45)]. If the data vector g(t, 9) does not depend on 9, this rela- 
tionship would be a linear regression. From this the term pseudolinear regression 
for (7.112) is derived (Solo, 1978). For the model (7.112), the “pseudo-regression 
vector” g(t, @) contains relevant past data. partly reconstructed using the current 
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model. It is thus reasonable to require from the model that the resulting predic. 
tion errors be uncorrelated with g(t.@). That is, we choose ¢ (t.0) = ¢(t.6) ang 
a(€é) = €.in (7.110) and arrive at the estimate 


R 

: ly 

APLR — so) OINO) — g'(1,0)6] = 0 (7.113) 
i=l 


which we term the PLR estimate. 

Models subject to (7.112) also lend themselves to a number of variants of 
(7.113). basically corresponding to replacing y(t. 0) with vectors in which the “re- 
constructed” (@-dependent) elements are determined in some other fashion. See 
Section 10.4. 


7.6 INSTRUMENTAL-VARIABLE METHODS 


Instrumental Variables 
Consider again the linear regression model (7.31): 


S118) = of (He (7.114) 


Recall that this model contains several typical models of linear and nonlinear systems, 
The least-squares estimate of @ is given by (7.34) and can also be expressed as 


N 
ss 1 
ats = sol wo 9) [o — %0] = | (7.115) 
t=] 
An alternative interpretation of the LSE is consequently that it corresponds to (7.110) 
with L(g) = 1 and f(t. 6) = g(t). f 
Now suppose that the data actually can be described as in (7.37): 


y(t) = 97 (1)0 + wlt) (7.116) 


We then found in Section 7.3 that the LSE Ay will not tend to 4 in typical cases. 
the reason being correlation between vp(f) and g(t). Let us therefore try a gen- 
eral correlation vector ¢(f) in (7.115). Following general terminology in the system 
identification field. we call such an application of (7.110) to a linear regression an 
instrumental-variable method (IV). The elements of ¢ are then called instruments 
or instrumental variables. This gives 


i 
AIV ly ates UE _ 2 
ôV = sol nL [p g"(r)e] = 0 (7.117) 


or 


N z N 

A £ 1 1 

Oy = È Yew" FLEO (7.118) 
t= t=] 
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provided the indicated inverse exists. For 6x to tend to 6 for large N, we see 


from (7.117) that then (1/N) y , $(t)uo(t) should tend to zero. For the method 
(7.117) to be successfully applicable to the system (7.116), we would thus require the 
following properties of the instrumental variable {(f) (replacing sample means by 
expectation): 


Ec(t)p!(t) be nonsingular (7.119) 
Et(t)uo(t) = 0 (7.120) 


In words, we could say that the instruments must be correlated with the regression 
variables but uncorrelated with the noise. Let us now discuss possible choices of 
instruments that could be subject to (7.119) and (7.120). 


Choices of Instruments 
Suppose that (7.114) is an ARX model 
y(t) + art — 1) +--+ + an, y@ — na) 
= bu(t —1) +--- + banut — ny) + v(t) (7.121) 


Suppose also that the true description (7.116) corresponds to (7.121) with the coef- 
ficients indexed by “zero.” A natural idea is to generate the instruments similarly to 
(7.121) so as to secure (7.119), but at the same time not let them be influenced by 
{vo(t)}. This leads to 


gt) = K(g)[—-x@ -—1) -—x@—2).. 


—x(t— na) u(t —1)...u(t — np)" (7.122) 
where K isa linear filter and x(t) is generated from the input through a linear system 
N(qg)x(t) = M(q)u(t) (7.123) 


Here 
N(q) = 14+ mg +-+ nng” 
M(q) = m + miq + +++ Mng ™" (7.124) 


Most instruments used in practice are generated in this way. Obviously, ¢ (f) is 
obtained from past inputs by linear filtering and can be written, conceptually. as 


g(t) = g(t, u~!) (7.125) 


If the input is generated in open loop so that it does not depend on the noise up(t) in 
the system, then clearly (7.120) holds. Since both the -vector and the ¢ -vector are 
generated from the same input sequence {p contains in addition effects from vo), 
it might be expected that (7.119) should hold “in general.” We shall return to this 
question in Section 8.6. 

A simple and appealing choice of instruments is to first apply the LS method 
to (7.121) and then use the LS-estimated model for N and M in (7.123). The in- 
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struments are then chosen as in (7.122) with K (q) = 1. Systems operating in closed 
loop and systems without inputs call for other ideas. See Problem 7G.3 for some 
suggestions. 


As outlined in Problem 7D.5, the use of the instrumental vector (7.122) to 
(7.124) is equivalent to the vector 


c(t) = Tol (*(—1) u(t — 2)...u(t — ng — np) V (7.126) 


The IV estimate 6!Y in (7.118) is thus the same for ¢* as for ¢ in (7.122) and does 
not, for example. depend on the filter M in (7.124). 


Model-dependent Instruments (*) 


The quality of the estimate gly will depend on the choice of ¢ (t). In Section 9.5 we 


shall derive general expressions for the asymptotic covariance of oy and examine 
them further in Section 15.3. It then turns out that it may be desirable to choose 
the filter in (7.123) equal to those of the true system: N(q) = Ao(q): Miq) = 
Bo(q). These are clearly not known, but we may let the instruments depend on the 
parameters in the obvious way: 


C0, 0) = K(q)[—x@ — 1,6)... x(t — na, 0) ult- 1)...u(t — np] 
A(q)x(t,9) = B(q)u(t) (7.127) 
In general, we could write the generation of ¢ (4. 0): 

EEO) = K,(q.0)u(t) Í (7.128) 


where K,,(g, 0) is a d-dimensional column vector of linear filters. 


Including a prefilter (7.110a) and a “shaping” function a@(-) for the prediction 
errors, the IV method could be summarized as follows: 


L(g) [xe — 97(t)6} (7.129a) 
Sl [/n(0. Z“) = 0] (7.129b) 


N 
1 
j= y Dusit. ba (erlt, 8) (7.129c) 


t=1 


g(t, 0) = Ky(q, @)u(t) (7.129d) 
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Extended IV Methods (*) 


So far in this section the dimension of ¢ has been equal to dim @. We may also 
work with augmented instrumental variable vectors with dimension dim ¢ > d. The 
resulting method, corresponding to (7.110) and (7.111), will be called an extended IV 
method and takes the form 


N 2 
are 1 
6’ = arg min x X C(t. aler(t. 8)) (7.130) 
6 t=] Q 
The subscript Q denotes Q-norm: 


Ixlb = x7 Qx (7.131) 


In case ¢ does not depend on @ and æ(£) = £, (7.130) can be solved explicitly. See 
Problem 7D.7. 


Frequency-domain Interpretation («) 


Quite analogously to (7.20) to (7.25) in the prediction error case, the criterion (7.129) 
can be expressed in the frequency domain using Parseval’s relationship. We then 
assume that a(€) = €, and that a linear generation of the instruments as in (7.128) 
is used. This gives 
1 f* rs - 
fx(0.2") x = | [Gyle'*) — Gleo] Uno? 


-x 
x Ale, O)L(e'”) Ky le”. 8) dw (7.132) 
Here A(q, 0) is the A-polynomial that corresponds to 6 in the model (7.121). 


Multivariable Case (*) 


Suppose now that the output is p-dimensional and the input m-dimensional. Then 
the instrument ¢(¢) isa d x p matrix. A linear generation of ¢ (t, 8) could still be 
written as (7.128), with the interpretation that the ¿th column of ¢ (t, @) is given by 


6,6) = KË (q, ult) (7.133) 


where Kq, 0) isa d x m matrix filter. [K,,(g,@) in (7.128) is thus a tensor, a 
“three-index entity”]. With æ (£) being a function from R? to R?” and L(q)a px p 
matrix filter, the IV method is still given by (7.129). 


7.7 USING FREQUENCY DOMAIN DATA TO FIT LINEAR MODELS (x) 


In actual practice, most data are of course collected as samples of the input and 
output time signals. There are occasions when it is natural and fruitful to consider 
the Fourier transforms of the inputs and the outputs to be the primary data. It could 
be. for example, that data are collected by a frequency analyzer, which provides the 
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transforms to the user. rather than the original time domain data. It could also be 
that one subjects the measured data to Fourier transformation, before fitting them 
to models. In some applications, like microwave fields, impedances, etc.. the raw 
measurements are naturally made directly in the frequency domain. This view has 
been less common in the traditional system identification literature, but has been 
of great importance in the Mechanical Engineering community, vibrational analysis, 
and so on. The possible advantages of this will be listed later in this section. The 
usefulness of such an approach has been made very clear in the work of Schoukens 
and Pintelon; see in particular the book Schoukens and Pintelon (1991 )and the survey 
Pintelon et.al. (1994). 

There is clearly a very close relationship between time domain methods and 
frequency domain methods for linear models. We saw in (7.25) that the prediction 
error method for time domain data can (approximately) be interpreted as a fit in the 
frequency domain. We shall in this section look at some aspects of working directly 
with data in the frequency domain. 


Continuous Time Models 


An important advantage to frequency domain data is that it is equally simple to build 
time continuous models as discrete time/sampled data ones. This means that we can 
work with models of the kind 


y(t) = G(p.9)u(t) + H(p. e(t) (7.134) 


(where p denotes the differentiation operator) analogously to our basic discrete 
time model (7.3). See also (4.49) and the ensuing discussion. Note the considerable 
freedom in parameterizing (7.134): from black-box models in terms of numerator 
and denominator polynomials, or gain, time-delay, and time constant (see (4.5())). 
to physically parameterized ones like (4.64). In addition to these traditional time- 
domain parameterizations, one may also parameterize the transfer functions in a way 
that is more frequency domain oriented. A simple case (see Problem 7G.2) is to let 


d 
Gliw. 6) = X (gf + igh) W, (k, œ — ox) 
k=1 (7.135) 
0 = [87 8i- 8> 8a] 


One should typically think of the functions W, (k, œw) as bandpass filters, with a 
width that may be scaled by y. The parameter g} would then describe the frequency 
response around the frequency value w. If the width of the passband increases with 
frequency we obtain parameterizations linked to wavelet transforms. See, e.g.. the 
insightful discussion by Ninness (1993). 


Estimation from Frequency Domain Data 


Suppose now that the original data are supposed to be 


ZN = (Yl). Ulo, k = 1,...N} (7.136) 
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where Y (w) and U (wp) either are the discrete Fourier transforms (2.37) of y(t) and 
u(t) or are considered as approximations of the Fourier transforms of the underlying 
continuous signals: 


Y(w) ~ f v(e dt (7.137) 
0 


Which interpretation is more suitable depends of course on the signal character. 
sampling interval. and so on. 

How to estimate @ in (7.134) or its discrete time counterpart from (7.136)? In 
view of (7.25) it would be tempting to use 


A 


y = argmin V (8) 


y (7.138) 


: > 1 
$ [Y (wx) — Ge! Ulo e 1a 
oe | H (ei? 0)” 


V(@) 


(replacing e!*? by iw, for the continuous-time model (7.134)). Here T is the 
sampling interval. 

If H in fact does not depend on @ (the case of fixed or known noise model) 
experience shows that (7.138) works well. Otherwise the estimate Ay may not be 
consistent. 

To find a better estimator we turn to the maximum likelihood (ML) method 
for advice. We give the expressions for the continuous time case: in the case of a 
discrete time model, just replace fw, by e/%?. We will also be somewhat heuristic 
with the treatment of white noise. 

If the data were generated by 


y(t) = G(p.O)u(t) + H(p. @)e(t) 
the Fourier transforms would be related by 
Y(w) = Gliw.@)U(w) + Hiw. E(w) (7.139) 


To be true, (7.139) should in many cases contain an error term that accounts for finite 
time effects and the fact that the measured data Y (wg) often are not exact realizations 
of (7.137). For periodic signals. observed over an integer number of periods. (7.139) 
may however hold exactly for the input-output relation between u and vy. 

Now, if e(t) is white noise, its Fourier transform (7.137) will have a complex 
Normal distribution (see (I.14}): 


E(w) € N,(0.A) (7.140) 


This means that the real and imaginary parts are each normally distributed, with 
zero means and variances A/2. The real and imaginary parts are independent and. 
moreover, E(w) and E(w.) are independent for w Æ aw». (For finite time there 
will remain some correlation for neighboring frequencies, which we will ignore here.) 
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This implies that 
Yw) € Ne(G (ia. O)U (wx). A LH (iw, )|") a: 
PY 
Y (w) and Y (w) independent for a, Æ a; ) 


according to the model, so that the negative logarithm of the likelihood function 
becomes 


N 
Vx (0) = Nlogà + > 2log|H (io. 9)| 
N - (7.142) 
+ 2 x IY (or) — Glia, O)U (ox)? - Hlo 0h 
The Maximum Likelihood estimate is 
6y = arg min Vy (6) (7.143) 


Remark: This is the ML criterion under the assumption (7.141). We noted 
above that the data might not be exactly subject to this condition, due to finite time 
effects when forming the Fourier transforms. It still makes sense to use the criterion 
(7.143), though. 


If we perform analytical minimization of (7.143) w.r.t. A, we obtain 


N 
6n = arg min x ‘log Wa (8) + 2 Yon.) (7.144) 


k=) / 
1 N 
Wu(O) = — > Yla) — Gliog. O)U (ay) |?  ——— (7.145 
v(9) N 2! (4) — Gia OU (on)? aa (7-148) 
Ay = Wy(6y) (7.146) 
Compared to (7.138) we thus have an extra term 

N 

Y log |H (ia. 0)? (7.147) 


k=1 


If the noise mode! is given and fixed, H does not depend on @, and the term (7.147) 
does not affect the estimate. This case of fixed noise models is very common in 
applications with frequency domain data (see Schoukens and Pintelon, 1991). One 
reason is that for a periodic input, we can obtain reasonable estimates of H in a 
preprocessing step. See Schoukens et.al. (1997)and (7.154) below. 
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We may also note that for any monic, stable, and inversely stable transfer 
function H(q.@) we have 


J log |H(e". 6)| dw = 0 (7.148) 

=n 

The expression will also hold if the integral is replaced by summation over the fre- 

quencies 27k/N.k =1..... N. This is the reason why (7.147) is missing from time 

domain criteria. like (7.25), which correspond to equally spaced frequencies w. 
Note that Wx (0) in (7.144) can be rewritten as 


IU (wD) 


— (7.149) 
|H (iw. 0)“ 


f 
Da ae 
Wy() = wD êta Güre 3 


in formal agreement with (7.25). Here G is the empirical transfer function estimate. 
ETFE, defined in (6.24). 


Some Variants of the Criterion 


Weighted Nonlinear Least Squares Criterion. Given an estimate G of the frequency 
function (the ETFE or anything else), it is natural to fit a parametric model to it by 
a (non-linear) least squares criterion 


. 
: los 2 
whe) = a 5 |G) — G(iwg.0)| Wy (7.150) 
k=l 


with some weighting function Wz. We see that this corresponds to the ML criterion 
with W; = IU (a) I? [H (ia, 8)|?. In other words, (7.150) can be interpreted as 
the ML criterion with a fixed noise model 
U (ax) |* 
Hie? = See (7.151) 
Wk 

The numerical minimization of this criterion is typically carried out using a damped 
Gauss-Newton method, like for most of the other criteria discussed in this book. See 
Section 10.2. 


A Linear Method. If we use an ARX-parameterization of the model (see (4.9)) 


B(p) 1 
then the criterion (7.149) takes the form 
N A 2 
WO = — Alon) Gliax) — Bion] Uw (7.152) 
N 


k=l 


This is a quadratic criterion in the coefficients of the polynomials A and B. It can 
therefore be minimized explicitly by the least squares solution. 
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A number of other variants can also be defined. We can. for example. define a 
frequency domain | V-method in analogy with (7.132). See Pintelon et.al. (1994 fo, 
a more complete survey. 


Some Practical Features with Estimation from Frequency Domain 

Data 

There are several distinct features with the direct frequency domain approach that 
could be quite useful. We shall list a few: 


e Prefiltering is known as quite useful in the time-domain approach. See Section 
14.4. For frequency domain data it becomes very simple: It just corresponds 
to assigning different weights to different frequencies in the weighted criterion 
(7.150). This. in turn. is the same as invoking a special noise model (7.151). 

Normally, it does not quite make sense to combine prefiltering with esti- 
mating a noise model. since a parameter-dependent weighting as in 


Y 
E 
wrlS(9) = TA 


k=l 


A, , 2 IU (w)? 
Ciona Giton ma e 
[O Liek De TE 


may undo any applied weighting from W. 

e Condensing Large Data Sets. When dealing with systems with a fairly wide 
spread of time constants. large data sets have to be collected in the time domain. 
When converted to the frequency domain they can easily be condensed. so 
that. for example, logarithmically spaced frequencies are obtained. At higher 
frequencies one would thus decimate the data, which involves averaging over 
neighboring frequencies. Then the noise level (Az) is reduced accordingly. 

e Combining Experiments. Nothing inthe approach of (7.141)-(7.143) says that 
the frequency response data at different frequencies have to come from the 
same experiment. or even that the frequencies involved (wg. k = 1..... N) 
all have to be different. It is thus very easy to combine data from different 
experiments. 

ə Periodic Inputs. The main drawback with the frequency domain approach is 
that the underlying frequency domain model (7.139) is strictly correct only for 
a periodic input and assuming all transients have died out. On the other hand, 
typical use of the time domain method assumes inputs and outputs prior to 
time ¢ = 0 to be zero. Whichever assumption about past behavior is closer to 
the truth should thus affect the choice of approach. Note, though, that both 
the time-domain and the frequency-domain methods allow the possibility to 
estimate a finite number of parameters that pick up these transients, and thus 
give correct handling of these effects. See also Section 13.3. 

e Non-Parametric Noise Estimates from Periodic Inputs, We have for the true 
system y(t) = yy, (t) + v(t). where v(t) = Go(q)u(t). If u(t) is periodic with 
period M , so will y„(t) be. after a transient. By averaging the output over K 
periods, 

, ko 
yt) = Ts yt +kM), t=1...., M 7.153 
WN = 2. (t + kM) l (7.153) 
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we can thus get a better estimate of ¥,,(¢), and also estimate the noise sequence 
as 
b(t) = v(t) — Y(t) (7.154) 


where the definition of ¥(t) has been extended to 1 < ¢ < N by periodic con- 
tinuation. Estimating the spectrum of v(t) with any (non-parametric) method 
gives a noise spectral model |H(iw)|? that can be used in (7.144). 


ə Band-Limited Signals. If the actual input signals are band-limited (like no 
power above the Nyquist frequency), the continuous time Fourier transform 
(7.137) can be well computed from sampled data. It is then possible to di- 
rectly build continuous-time models without anv extra work. Notice also that 
frequency contents above the Nyquist frequency can be eliminated from both 
input and output signals by anti-alias filtering (see Section 13.7) before sam- 
pling. Such filtering will not distort the input-output relationship. provided the 
input and output are subjected to exactly the same filters. 


è Continuous-Time Models. The comment above shows that direct continuous- 
time system identification from “continuous-time data” can be dealt with 
in a rather straightforward fashion. Otherwise. continuous time data with 
continuous-time white noise descriptions are delicate mathematical objects. 


e Trade-off Noise/Frequency Resolution. The approach also allows for a more 
direct and frequency dependent trade-off between frequency resolution and 
noise levels. That will be done as the original Fourier transform data are deci- 
mated to the selected range of frequencies wg, k = 1.....N. 


78 SUMMARY 


There are several ways to fit models in a given set to observed data. In this chapter we 
have pointed out two general procedures. Both deal with the sequence of prediction 
errors {£(t. 0)} computed from the respective models using the observed data. and 
both could be said to aim at making this sequence “small.” 

The prediction-error identification approach (PEM) was defined by (7.10) to 


(7.12): 
by = arg min Vy (8. z“) 
GE Day 
1 N 
Ny n 
Vy(0, Z“) = x $ e (et, 8). 6.2) (7.155) 


t=l 


It contains well-known procedures, such as the least-squares (LS) method and the 
maximum-likelihood (ML) method and is at the same time closely related to Bayesian 
maximum a posteriori (MAP) estimation and Akaike’s information criterion (AIC). 

The subspace approach to identifying state-space models was defined by (7.66). 
It consists of three steps: (1) estimating the k-step ahead predictors using an LS- 
algorithm, and (2) selecting the state vector from these. and finally (3) estimating the 
state-space matrices using these states and the LS-method. 
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The correlation approach was defined by (7.110): 
er(t,.6@) = Lig)e(t, @) 


a 


dv = sol [fv(9,Z") = 0] 


N 


N l 
fy(0.Z") = = $ clt. Paer(t. 6) (7.156) 


f=1 


It contains the instrumental-variable (IV) technique, as well as several methods for 
rational transfer function models. 

System identification has often been described as an area crowded with seem- 
ingly unrelated ad hoc methods and tricks. The list of names of available and sug- 
gested methods is no doubt a very long one. Ht is our purpose, however, with this 
chapter, as well as Chapters 8 to 11, to point out that the number of underlying basic 
ideas is really quite small. and that it indeed is quite possible to orient oneself in the 
area of system identification with these basic ideas as a starting point. 

It might be added that for systems operating in closed loop some special iden- 
tification techniques have been devised. We shall review these methods in Section 
13.5, in connection with a discussion of the closed loop experiment situation. The 
bottom line is that a direct application of the prediction error methods of this chapter 
should be the prime choice, also for closed loop data. 
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7.10 PROBLEMS 
7G.1 Input error and output error methods: Consider a model structure 
v(t) = G(q. O)u(t) 
without a specified noise model. In the survey of Åström and Eykhoff (1971 jidentif.. 


cation methods that minimize “the output error” 
by = argmin $ [y(2) — G(qg. @)utt)]? 
t=) 
and the “input error” 
N 
Êy = argmin >» [u(r) — G'g. aya]? 
t=1 
are listed. Show that these methods are prediction error methods corresponding to 
particular choices of noise models H(q. 9). 
7G.2 Spectral analysis as a prediction error method: Consider the model structure 
x 
G(e®.0) = $ (gf + igl) W, (w — w) 
k=l 


and let H (ef®, n) be an arbitrary noise model parametrization. Let 6x be the prediction- 
error estimate obtained by minimization of (7.23) and (7.25): 
f lUx (w)? 

dw 


Ĝyle“) n= Gle”. 8) 


Êy = arg min f - 
6n Jon | H (ei, n)| 


(a) Consider the special case H(e'”. n) = 1 and 


nr 
1, lel < mn 
W,(@) = T 
0 — 
ka 2n 
(k —1)z 
w; = 
n 


Show that G(eive ; 6x) is then given by (6.46). 
(b) Assume, in the general case. that 


Hle. n) - Wylo — wn) = Hel)» Ww o) 
Gle”. 0) - Wy(w — o) = G.O) - Wy(w — o) 


Show that (6.46) then holds approximately. 
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1G.3 Instruments for closed-loop systems: Consider a system 


APN) = BADU) + ult) 
under the output feedback 
u(t) = Fy(g)r(t) — Fulg)v(t) 
(a) Let xiz) and ¢(7) be given by 


N(q)x(t} 


M(q)r(t) 


ct) = K| -xt — 1)... =x na) rt — 1). = mp) J! 


Show that (7.120) holds for these instruments. and verify that (7.119) holds for a 
simple first-order special case. 


(b) Suppose that vy{t) is known to be an MA process of order s. Introduce the 
instruments 


c(t) = [-yv@ — 1 — s}... —y(t — Rha — S) u(t — 1 — s)...uit — np — s)]7 


Show the same results as under part (a). See also Söderström. Stoica and Trulsson 
(1987). 


7G.4 Suppose Yy = [v(1),.... v(N yr is a Gaussian N -dimensional random vector with 


zero mean and covariance matrix Rv(@). Let 
Ry (6) = LOA x(O LI) 


where Ly(0) is lower triangular with 1’s along the diagonal and A,y(@) a diagonal 
matrix with å+ (£) as the t. £ element. Let 


Ex(@) = L3'(@)¥x 
Ex (6) = [€(1.6)..... PNAD 


Show that. if & is a parameter to be estimated, then the negative log likelihood 
function when Yy is observed is 


N 1 1 
> log 27 + 5 log det Ry (@) + SYN Ry Yn 


Show also that this can be rewritten as 
N N 


N Le 1 e"(t, 9) 
— log2x + -Y logag(t) +- 
2 £ a DWE a? A(T) 


where e(t. 0) are independent. normal random variables with variances àẹ (t). How 
does this relate to our calculations (7.81) to (7.87)? 


238 Chap. 7 Parameter Estimation Methods 


7G.5 Let the two random vectors X and Y be jointly Gaussian with 
EX = mx; EY = my 


E(X — my)(X — my)? 


ll 


Py  E(¥ —my)(Y — my)’ = P 
E(X — my)(Y¥ — my)? 


Pxy 
Show that the conditional distribution of X given Y is 
(XIY) € N (mx + PyyPy'(¥ — my), Py — Pyy Py! Pyy) 
7G.6 Consider the model structure 


X 


F(@)W 
(7.137) 


H(@)X +E 
where W and E are two independent, Gaussian random vectors with zero mean values 
and unit covariance matrices. Note that state-space models like (4.84). without input. 
can be written in this form by forming X? = [x™1) x7(2)...x7(N)] and Y7 = 
EYA) ¥(2)...¥€N)]}. Let 

R(0) = 1 + H(0)F(0)F™(0)H7(0) 
Show the following: 


(a) The negative log likelihood function for 8. (ignoring 6 -independent terms) when 
Y is observed is 


V(6) = —log p(Y¥ |0) = JYT RMO)Y + f log det R(0) 
Let 
ôu = arg min V(6) 
(cf. Problem 7G.4). 
(b) Let the conditional expectation of X. given Y and @ be x 5(0}. Show that 
E(X|Y,0) = X°(@) = [F(@)F7(@)H"0)] RY (7.158) 
and that 
— log p(X1, Y) = } (X - oY S-10) (X — K@)) + }logdet 518) 
S(0) = F(8)F7(0) — FO) F'6)H(O)R (0) H(0) F0) F70) (7.159) 


(cf. Problem 7G.5) [xX *(@) gives the smoothed state estimate for the underlying 
state space model. see Anderson and Moore (1979)]. 
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(c) Assume that the prior distribution of 6 is flat. (p(@) = independent of 6). Then 
show that the joint MAP estimate (7.77) of 9 and X given Y, 


Cae Xie) = arg max p(@. X|Y) 
v.X 


is given by 
arg min [— log p(Y. X|@)] 
X49 
where 
—log p(Y. X16) = }|Y — H(@)XP + F (FIX! + log det F(A) (7.160) 


(d) Show that the value of X that minimizes (7.160) for fixed Y and 4 is X°(@). 
defined by (7.158). Hence 


we x . v 1? 
ĝar = arg min [ijr - roto) +3 


x 2 
F-!(0)X°{(8)| + log det F») 


XMaP = X* (tap) 
(e) Establish that 


— log p(Y|@) = —log p(Y. X6) + log p{X|6é. Y} (7.161) 
(f) Establish that 


' R 12 x 2 
—log p(Y|@) = E = H(@)X*(O), + ; F-\(0)X5(0), + Hog det RO) (7.162) 
[Hint: Use the matrix identity (cf. (7.159)) 


S(@) = [F-T(@)F 6) + HOH 
and the determinant identity 
det(I, + AB) = det(/, + BA) 
for A and B being r x s and s x r matrices and /, the r x r identity matrix.] 
(g) Conclude that Oy, Æ O,,p in general. 


Remark: The problem illustrates the relationships among various expres- 
sions for the likelihood function. the smoothing problem, and MAP-estimates. 
“Log likelihood functions” of the kind (7.160) have been discussed. e.g., in Sage 
and Melsa (1971)and Schweppe (1973). Section 14.3.2. 


7G.7 Consider the linear regression structure 
y(t) = p08 + v(t) 


Based on the theory of optimal algorithms for operator approximation. (Traub and 
Wozniakowski, 1980), Milanese and Tempo (1985), and Milanese, Tempo. and Vicino 
(1986)have suggested the following estimate: 
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For given ô. yy and {gtr} . define the set 
As = [ell — ge] S alle = 1.....N} 


Assuming Aș. to be bounded and non-empty define its “center” 8, (A;) as follows: 
The ith component is 


[A.( As} = 1 [supye4,0"" + infsea, 0] 
(superscript (7) denoting / : th component). The estimate a’. is then taken as 4. (-1.), 


(a) Suppose that dim 9 = 1. Prove that 6° is independent of 6. as long as A: is 
nonempty and bounded. 


(b) When dim @ > 1. ô? may in general depend on 5. Suppose that as ô decreases 
to a value 5°. Ay reduces to a singleton 


Age = {8°} 
Then clearly 6s = 0* . Show that 
Ə” = arg mi KORET 
y = gmin max ‘ y(t) ¢ (6| 


This “optima! estimate“ thus corresponds to the prediction error estimate (7.12) 
with the £, -norm 
Ex (e%(-,6)) = max etr. 6) 
This in turn can be seen as the limit as p — oc of the criterion functions 
Ee) = lel? , 
in (7.11). 
7E.1 Estimating the AR Part ofan ARMA model: Consider the ARMA model 
A(g)y(t) = C(q)etr) 
with orders na and ne. respectively. A method to estimate the AR part has been given 


as follows. Let 
N 


Ay 1 
RY = 5 Dre - 7) 


t=t 
Then solve for â from 
RY (t) + a RY (t — 1) + + tan, RYT — na) = 0 
tT = Ne + litte + 2... n, + Ny 


Show that this (essentially) is an application of the IV method using specific instruments. 
Which ones? (See Cadzow, 1980. and Stoica, Söderström, and Friedlander. 1985.) 


TE.2 
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Sinusoids in noise: Consider a sinusoid measured in white Gaussian noise: 
v(t) = ae’ + elt) 


For simplicity we use complex algebra. The constant œ is thus complex-valued. The 
amplitude. phase and frequency are unknown: 0 = (œ, œw). The predictor thus is 


Sle) = ae’ 


If e(t) has variance | (real and imaginary parts independent). the likelihood function 
gives the prediction-error criterion: 


N 


? 1 a 
Vw(9.Z*) = 5 D lro — Sao 


t=] 


Show that the MLE 
a an 
6y = Be = argmin Vy(6, Z”) 
g 


obeys 


@y = arg max |Yy(w)|? 
@ 


where Yy (w) is the Fourier transform (2.37) of y(t}. 


Error-in-variables models: Econometric models often include disturbances both on 
inputs and outputs (compare our comment in Section 2.1 on Figure 2.2). Consider the 
model in Figure 7.1. The true inputs and outputs are thus s and x, while we measure u 
and y. Ina first-order case, we have 


x(t) + ax(t — 1) = bs(t — 1) 
y(t) = x(t) + elt) 
u(t) = s(t) + w(t) 


Suppose that w and e are independent white noises with unknown variances. 
Discuss how a. b, and these variances can be estimated using measurements of y and u. 


u y 


Figure 7.1 An error-in-variables model. 
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[Remark: With the assumption that the color of the noises are known. the prob- 
lem is relatively simple. Without this assumption the problem is more difficult. See 
Kalman (1991), Anderson (1985), Söderström (1981), McKelvey (1996). and Stoica 
et.al. (1997)]. 


Consider a probabilistic model. implicitly given in the state-space form 
x(t +1) = ax(#) + w(t) 
y(t) = x(t) + vir) (7.163a) 


where (u(t)} and {v(:)} are assumed to be independent. white Gaussian noises. with 
variances 


Ewn =n 


Ev*(t) = 1 (assumed known) (7.163b) 


Let the parameter vector be 


a 
6 = l | (7.1630) 


ri 


Assume initial conditions for x(0) (mean and variance) such that the prediction \ (tir) 
becomes a stationary process for each @ (i.e.. so that the steady-state Kalman filter can 
be used). Determine the log-likelihood function for this problem. Compare with the 
log-likelihood function for a directly parametrized innovations representation model 
(4.91). 


Consider the nonlinear model structure of Problem 5E.1. Discuss how the LS. ML. IV. 
and PLR methods can be applied to this structure. (Reference: Fnaiech and Ljung. 
1986). 


Consider the model structure ' 
y(t) = g8 + ve) 
where the regression vector y(t) can only be measured with noise: 
nt) = g(t) + wt) 


The noises {w(t)} and {v(t)} may be nonwhite and mutually correlated. Suppose a 
vector ((f) is known that is uncorrelated with {v(t)} and {w(t)} but correlated with 
p(t). Suggest how to estimate @ from y(t). n(f). and ((f),f = 1,..., N. 


Suppose in (7.86) and (7.87) that A does not depend on 6. Determine AN. 
Consider the model structure 


FOO) = —ay(t — 1) + buit — 1) 
and assume that the true system is given by 
y(t) — 0.9y(t — 1) = u(t — 1) + ealt) 


where {eg(t)} is white noise of unit variance. Determine the Cramér-Rao bound for 
the estimation of a and b. How does it depend on the properties of u? 
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7E.9 Suppose that u(r) is periodic with period M . and that all transients have died out. We 
collect data over K periods: {v(t}. u(f)].f = 1...., KM. We take the DFT of the 
signals and form (7.144) for a fixed noise model H* (iw). Show that this gives exactly 
the same results as if we just take the DFT over one period and use the averaged output 
(7.153). Is it essential that the noise model is fixed? 


7F.1 Suppose that a true description of a certain system is given by 
yt) + apy — D +... + an y(t — na) = Blut — 1) +... + bh, ult — np) + volt) 
for a stationary process {t'9(t)} independent of the input. Let g(r) be defined. as usual. 
by (7.32). and let ¢(¢) be given by 
p(t) = [~y — 1)... -yolf — Na) ult — 1)...u{t — ap) y 
where 
volt) + apvolt — 1) +... + ah volt — na) = bjult — 1) +... + bg u(t — ne) 
Prove that for any vector of instrumental variables of the general kind (7.122) we have 


Eee (rt) = Ecce" (tr) 


7D.1 Consider the ARX structure (4.7) where one parameter. say bı. is known to have a 
certain value bY. Show that the associated predictor can be written as 


$110) = OF p(t) + wt) 
with proper definitions of 6, g. and u (gy and u to be known variables at time 7). 
Derive the LS estimate and the [V estimate for this model. 
7D.2 Let A be a given. positive symmetric definite matrix and let B and C be given matrices. 
Establish that 
67 Ae — 6' B — B79 +C = [0 — A'B] Afo — AB] + C — B’A'B 
> C — B'A`'B 


and use this result to prove all the expressions for the LSE in Section 7.3 {(7.34). (7.41). 
(7.43). and (7.46)]. The matrix inequality D > B is to be interpreted as “D — B isa 
positive semidefinite matrix.” 

Hint: For (7.46). rewrite (7.45) as 


k 
y ly 
V0.2") = u5 > Lye) - 6 pin] [re -Tg 


t=] 


7D.3 Let È be an invertible square p x p matrix with elements o;;. Prove the differentiation 
formula 


ð 
—— det È = detl È] - uj; 
da; ; el] Hi 


where j is the į, j element of E~'. [Hint: Use det(/ + £A) = 1 + € tr A+ higher- 
order terms in £]. Use the result to prove (7.94) and (7.95). 

7D.4 Show that the two instrumental variable vectors, of dimension d.čı(t) and f2(t), where 
fi(t) = Tp.(t) with T invertible. give the same estimate Oy in (7.118). 
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7D.5 


7D.6 


7D.7 


Show that if two variables x and u are associated as in (7.123) and (7.124) then we can 
write 


—x(t — 1) 
; u(t — 1) 
—x(t — Nn l u(t — 2) 
eee = NEM NI g) 
: u(t — A, — Mm) 
u(t — nm} 


foran (An Am) X (Np + Am) matrix 


—Mo —Mı —M nm 0 0 
0 —mo Manm- Mam 0 
0 0 Metts —m — -> —Mp 
S(-M, N) = M o TEN m 
1 ñj a Ann 0 een 0 
0 1 mee | Any 0 
0 Oe - ae 1 nı sen Mny 


Such a matrix is called a Sylvester matrix (see, e.g., Kailath. 1980), and it will be 
nonsingular if and only if the polynomials in (7.124) have no common factor. Use this 
result to prove that the instruments (7.126) give the same IV estimate as the instruments 
(7.122). Reference: Söderström and Stoica (1983). 


Show that the prediction-error estimate obtained from (7.11) and (7.12) can also be 
seen as a correlation estimate (7.110) for a particular choice of L, €.and a. 


Give an explicit expression for the estimate 6£"" in (7.130) in the case ¢ does not depend 
on @.and a(¢) = €. 


Consider the symmetric matrix 


Show that if H > 0. then 
A — BC 'B’ > 0. 

Hint: Consider x Hx! for 
x= [xi —x,BC'] 


with x, arbitrary. 
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APPENDIX 7A: PROOF OF THE CRAMER-RAO INEQUALITY 


The assumption EGY) = y can be written 


b = f B(x") A (Op. 0%) dx" (7A.1) 
R* 


By definition we also have 


1= f F Oox dx* (7A.2) 
RY 


Differentiating these two expressions with respect to 4 gives 


re te oa (See 
i = x — f, ae x 
[ a | Flt X | dx 


A N d N r N N 
ox) | —— log f(O. x" )] fr (Go, x”) dx (7A.3) 
R* do : i 


z d Pa tl 
= N l A , N 
= EA(y | og fy (Oo. 3 | 


(J is the d x d unit matrix) and 


d d mat 
f , E Fy (Oo. | dx” = Í, E log fy Oo. a ] fy (8. x )dx% 
N R` 


0 


d ay 
E ÈE log fy (@o. ie | (7A.4) 


Expectation in these two expressions is hence w.r.t. vy’. 
Now multiply (7A.4) by 6 and subtract it from (7A.3). This gives 


Ar N d N d 
E [ôo )- 6 | E log fy (Oo. ¥ J =] (7A.5) 
Now denote 
AN d N 
a = A(y") — Oo. B = — log fro y”) (7A.6) 
dA f 


(both d-dimensional column vectors) so that 
Eap’ = 1 (7A.7) 


Hence 
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where the positive semidefiniteness follows by construction. Hence Problem 7D.8 
proves that 


Eaa’ > [EBT] 


which is (7.79). It only remains to prove the equality in (7.80). Differentiating the 
transpose of (7A.4) gives 


a $ : 
0 = Í p log fyo x") flo. x") dx* 
RY do; ó ` 


d d T 
+ Í, E log Fr (%. x)| E log fy (Oo, | fy (Oo. x^ ) dx* 
which gives (7.80). 
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CONVERGENCE AND 
CONSISTENCY 


8.1 INTRODUCTION 


In Chapter 7 we described a number of different methods to determine models from 
data. To use these methods in practice, we need insight into their properties: How 
well will the identified model describe the actual system? Are some identification 
methods better than others? How should the design variables associated with a 
certain method be chosen? 


Such questions relate, from a formal point of view, to the mapping (7.7) from 
the data set Z™ to the parameter estimate ĝy : 


ZN > Oy € Dü (8.1) 


Questions about properties of this mapping can be answered basically in two ways: 


1. Generate data Z™ with known characteristics. Apply the mapping (8.1) (cor- 
responding to a particular identification method) and evaluate the properties 
of Ox. This is known as simulation studies. 


2. Assume certain properties of Z™ and try to calculate what the inherited prop- 
erties of @y are. This is known as analysis. 


In this chapter we shall analyze the convergence properties of Ôn as N tends 
to infinity. Since we will never encounter infinitely many data, such analysis has the 
character of a “thought experiment,” and we must support it with some assumptions 
about a corresponding infinite data set Z™. There are some different possibilities for 
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such assumptions (see Problem 8T.1). Here we shall adopt a stochastic framework 
for the observations, along the lines described in Chapter 2. We shall thus consider 
the data as realizations of a stochastic process with deterministic components. It 
might be worthwhile to contemplate what analvsis under such assumptions actually 
amounts to. A probabilistic framework relates to the following questions: What 
would happen if I repeat the experiment? Should I then expect a very different 
result? Will the limit of Êy depend on the particular realization of the random 
variables? Even if the experiment is never repeated, it is clear that such questions 
are relevant for the confidence one should develop for the estimate, and this makes 
the analysis worthwhile. It is then another matter that the probabilistic framework 
that is set up to answer such questions may exist only in the mind of the analyzer and 
cannot be firmly tied to the real-world experiment. 

It should also be remarked that a conventional stochastic description of dis- 
turbances is not without problems: For example. suppose we measure a distance 
with a crude measuring rod and describe the measurement error as a zero-mean 
random variable, which is independent of the error obtained when the experiment 
is repeated. This assumption implies. by the law of large numbers, that the distance 
can be determined with arbitrary accuracy. if only the measurements are repeated 
sufficiently many times. Clearly such a conclusion can be criticized from a practica] 
point of view. Results from theoretical analysis must thus be interpreted with care 
when applied to a practical situation. 


The question of how 8y behaves as N increases clearly relates to the question of 
how the corresponding criteria functions Vy(@. Z“) and fy (0. Z™) behave. These 
are. with a stochastic framework, sums of random variables, and their convergence 
properties will be consequences of the law of large numbers. Our basic technical 
tool in this chapter will thus be Theorem 2B.1. In order not to conceal the basic 
ideas with too much technicalities, we shall only complete the proofs for linear, time- 
invariant models (such as those in Chapter 4) and quadrafic criteria. The techniques 
and results, however. carry over also to more general cases. 

The chapter is organized as follows. Assumptions about the infinite data set 
Z™ are given in Section 8.2. Convergence for prediction-error estimates is treated 
in Section 8.3. Consistency questions (i.e., whether the true system is retrieved in the 
limit) are discussed in Section 8.4. A frequency-domain characterization of the limit 
estimate is given in Section 8.5. In Section 8.6. the corresponding results are given 
for the correlation approach. 


A Preview 


In the chapter a general and natural result is derived: the estimate Ôn obtained by 
the prediction-error method (7.155) will converge to the value that minimizes the 
average criterion E£ (e(t.9). 6). Here E can heuristically be taken as averaging 
over time or ensembles (possible realizations) or both. The chapter deals both with 
the formal framework for establishing this “obvious” result and with characteriza- 
tions of the limit value of Êy. The reluctant reader of theory should concentrate on 
understanding the main result, Equation (8.29), and the frequency-domain charac- 
terization of the limit model in Section 8.5. 
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8.2 CONDITIONS ON THE DATA SET 


The data set 
Z“ = {u(1), y(1)..... u(N). v(N)} 


is the basic starting point. Analysis, we said, amounts to assuming certain properties 
about the data and computing the resulting properties of Ôn . Since the analysis of 
6x will be carried out for N — œ, it is natural that the conditions on the data relate 
to the infinite set Z™. In this section we shall introduce such conditions, as well as 


some pertinent definitions. 


A Technical Condition D1 (*) 


We shall assume that the actual data are generated as depicted in Figure 8.1. The 
input u may be generated (partly) as output feedback or in open loop (u = w). 
The signal eo represents the disturbances that act on the process. [The subscript 0 
distinguishes this “true” noise eo from the “dummy” noise e we have used in our 
model descriptions (7.3).] The prime objective with condition D1 is to describe the 
closed-loop system in Figure 8.1 as a stable system so that the dependence between 
far apart data decays. The most restrictive condition is the assumed linearity (8.2). It 
can be traded for more general conditions. at the price of more complicated analysis. 
See Ljung (1978a). condition S3. For our analysis. we shall use the following technical 
assumptions: 


D1: The data set Z™ is such that for some filters TRGI 


(9,4%) oa 
yt) = Sod rit — k) + Sod eo — k) 


k=1 k=0 
(8.2) 
oe s oc 
u(t) = X dP kri — k) + Dod eo(t — k) 
k=0 k=0 
where 
1. {r(ż)} is a bounded. deterministic, external input sequence. (8.3) 
2. {eo(t)} is a sequence of independent random variables with zero mean values 
and bounded moments of order 4 + 6 for some 6 > 0. (8.4) 
Moreover, 


4 oO 
3. The family of filters {aay} oi = 14:1 = 1.2... is uniformly 
stable. = (8.5) 
4. The signals {v(t}}, {u(t)} are jointly quasi-stationary. (8.6) 
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Figure 8.1 The data-generating configuration. 


Recall the definitions of stability (2.29) and quasi-stationarity (2.58) to (2.62), 
(Problem 2T.4 showed that uniform stability holds. even if the closed-loop system 
goes through “unstable transients.) 


Remark. When we say that {r(t)} is “deterministic.” we simply mean that 
we regard it as a given sequence that (in contrast to eg) can be reproduced if the 
experiment is repeated. The stochastic operators and qualifiers. such as £. w.p.1. and 
ASN will thus average over the properties of {en(t)} for the fixed sequence {r(r)}. Of 
course. this does not exclude that this particular sequence {r(?)} actually is generated 
as a realization of a stochastic process, independent of the system disturbances. In 
that case it is sometimes convenient to let the expectation also average over the 
probabilistic properties of {r(¢)}. We shall comment on how to do this below [Eq. 
(8.27)]. 


A True System S 


Í 
We shall sometimes use a more specific assumption of a “true system”: 
S1: The data set Z™ is generated according to 


S: ¥(t) = Goiqu(t) + Holq)ealt) (8.7) 


where {eo(t)} is a sequence of independent random variables. with zero mean values. 
variances Ag, and bounded moments of order + + ô. some 6 > 0. and Ho(q) is an 
inversely stable, monic filter. 


We thus denote the true system by S. Given a model structure (4.4). 
M: {G(q.0), H(g.6)|@ € Dm} (8.8) 


it is natural to check whether the true system (8.7) belongs to the set defined by (8.8). 
We thus introduce 


Dr(S, M) = {8 € Dml G(e'’.@) Z Galei”): 
H(e'’,0) = Hfle): -m <w <a} (89) 
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This set is nonempty precisely when the model structure admits an exact description 
of the true system. We write this also as 


SEM (8.10) 


Although such an assumption is not particularly realistic in practical applications, it 
yields a quite useful insight into properties of the estimated models. 
When S1 holds. a more explicit version of conditions DI can be given: 


Lemma 8.1. Suppose that S1 holds, and the input is chosen as 
u(t) = —F(q)y(t) + r(t) 


such that there is a delay in either Go or F and such that 


[1 + GAD FDY Golg). [1 + Got FDI Hola). 
F(q)[1 + GDF! Gola). Fio) [L + Galga) Fiq) Hola) 
are stable filters and that {u:(7)} is quasi-stationary. Then condition D1 holds. 


Proof. We have. for the closed-loop system, 


y(t) = [1 + Gag) F(Y Golg)r(t) + [1 + Gota) F (I Ho(q)eo(t) (8.11) 


and similarly for u. The stability condition means that the filters in (8.11) are stable. 
Thus (8.6) follows from Theorem 2.2. Moreover. (8.2), and (8.5), are immediate from 
(8.11) and the stability assumption. C 


Information Content in the Data Set 


The set Z* is our source of information about the true system. This is to be fit to 
a model structure M of our choice. (The reader might at this point review Section 
4.5. if necessary.) The structure M describes a set of models M* within which 
the best one is sought for. Identifiability of model structures concerns the question 
whether different parameter vectors may describe the same model in the set M*. 
See Definitions 4.6 to 4.8. A related question is whether the data set Z™ allows us to 
distinguish between different models in the set. Recall that. according to Definition 
4.1. a (linear time-invariant) mode] is given by a filter W(q). We shall call a data 
set informative if it is capable of distinguishing between different models. We thus 
introduce the following concept: 


Definition 8.1. A quasi-stationary data set Z™ is informative enough with respect 
to the model set M* if. for any two models W,(q) and W2(q) in the set, 


E ((Wi(q) — Wq) <P = 0 (8.12a) 


implies that Wilet) = W2(e'”) almost all w. 
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We note that with 
Wi (q) = W(q) = [ AW, (q) AW, (q) } 


(8.12a) can be written 


Efa wout) + AWO = 0 (8.12b) 


Note that the limit in (8.12) exists in view of (8.6) and Theorem 2.2. Recall also 
(4.112) and the definition of equality of models. (4.116). 


Definition 8.2. A quasistationary data set Z™ is informative if it is informative 
enough with respect to the model set £*, consisting of all linear. time-invariant 
models. 


The concept of informative data sets is very closely related to concepts of “per- 
sistently exciting” inputs. “general enough“ inputs, and so on. We shall discuss the 
concept in detail in Chapter 13 in connection with experiment design. Here we give 
an immediate consequence of Definition 8.2. 


Theorem 8.1. A quasi-stationary data set Z™ is informative if the spectrum matrix 
for z(t} =[u(t) y(t) d is strictly positive definite for almost all w. 


Proof. Consider (8.12) for arbitrary linear models W; and W2. Let us denote 
Wi (q4) — W2(q) = W(q). Then applying Theorem 2.2 to (8.12) gives 


T . 
0 = Weed oW Ee.) di 


=T 


where 


(8.13) 


®.(w) = | P(w) o 


Pyu (w) p, (w) 


Since ®,(w) is positive definite, this implies that W(e'”) = 0 almost everywhere, 
which proves the theorem. o 


Some Additional Concepts and Notations (*)} 


In Definition 4.3 we defined a model structure as a differentiable mapping. such that 
the predictors and their gradients were stable for each 6 € Dm. To facilitate the 
analysis, we now strengthen this condition. 


Definition 8.3. A model structure M is said to be uniformly stable if the family 
of filters {W (4,0), ¥(qg.@) and (d/d0)V(q, 8): 6 € Dy} is uniformly stable and if 
the set Dy is compact. [Recall the definition (2.29).] 
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Analogous to (4.109), we shall. when S1 holds, define 


u(t) 
xot) = (8.14a) 
y(t) 
and 
Tolg) = [Golq) Hol(g) ] (8.14b) 


The system (8.7) can thus be written 
y(t) = To(q) x(t) 
The difference will be denoted 


T(q.9) = Inq) — T(q.0) = [Giq.9) Hiqg.6)] (8.15) 


8.3 PREDICTION-ERROR APPROACH 


Basic Result 
The prediction-error estimate is defined by (7.12) 


by = arg min Va (0, Z*) (8.16) 
GED 


To determine the limit to which Êy converges as N tends to infinity is obviously 
related to the limit properties of the function Vy (0. Z“). For a quadratic criterion 
and a linear, uniformly stable model structure M, we have 


N 
, lx 
Vy (0, Z^) = — leit. 17 
(6, Z“) Ta ) (8.17) 
and, using (7.2). 
elt, 0) = [1 — Wg. D] x) — Wala. u(t) (8.18) 
Under assumption D1 we can replace y(t) and u(t) in the preceding expression by 
(8.2), which gives 
xX À dX 
elt, 0) = Sod k Ort — k) + Y di (k: Meo(t — k) (8.19) 


k=] k=0 
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Now the filters in (8.18) are uniformly (in @) stable since M is uniformly stable. 
Under assumption (8.5) the filters in (8.2) are uniformly (in ¢) stable. Hence the 
cascaded filters {df (k: 6)}, i = 5, 6. in (8.19) are also uniformly (in both @ and 7) 
stable (see Problem 8D.2). That is, 


x 
dik: O)| < B- Yt. Y6 € Dmi = 5,6 POA < (8.20) 
1 


Finally, under assumption (8.6), Theorem 2.2 implies that {e (r. 8)} is quasi-stationary. 
All conditions for Theorem 2B.1 are thus satisfied, and applving this theorem 
to (8.17) with (8.19) gives the following result. 


Lemma 8.2. Consider a uniformly stable, linear model structure M (see Defi- 
nitions 4.3 and 8.3). Assume that the data set Z* is subject to D1. Then. with 
Vy (0. Z“) defined by (8.17), 


sup |Vy(@,Z") — V(@)| > 0. wp.lasN > x (8.21) 
AEDn 
where 
V(0) = Ete, 8) (8.22) 


The criterion function Vy (0, ZY) thus converges uniformly in 0 € Dm to 
the limit function V (8). This implies that the minimizing argument Êy of Vy also 
converges to the minimizing argument 6* of V since Dm is compact. Notice that it 
is essential that the convergence is indeed uniform in @ for this to hold (see Problem 
8D.1). It may happen that V(@) does not have a unique global minimum. In that 
case we define the set of minimizing values as 


D. = arg min V (0) = foi € Dy. V(0) = min ven) (8.23) 


6eDy EV ay 


We can thus formulate this corollary to Lemma 8.2 as our main convergence result: 


Theorem 8.2. Let 6x be defined by (8.16) and (8.17). where e(t. 6) is determined 
from a uniformly stable linear model structure M. Assume that the data set Z* is 
subject to D1. Then 


6y > De.  wp.lasN > œ (8.24) 
where D, is given by (8.22) and (8.23). 


Remark. Convergence into a set as in (8.24) is to be interpreted as 


inf [ay -3| +0, as N > 20 (8.25) 
jeD. 
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The function V(@) will in general depend both on the true system and the input 
properties. With a quadratic criterion and a linear model structure. it follows from 
Theorem 2.2 that it depends on the data only via the spectrum matrix ®-(@) in 
(8.13). [Explicit expressions will be given in (8.63) to (8.66).] This has the important 
consequence that it is only the second-order properties of the data that affect the 
convergence of the estimates. 


Ensemble- and Time-averages 


The signal sources for é(t.6) are r and eg. as evidenced by (8.19). Recall that 
r = u incase of open loop operation. The symbol E denotes as defined in (2.60) 
ensemble-averaging (“statistical expectation”) over the stochastic process {eo(t)} 
and time-averaging over the deterministic signal {r(t)}. The function V(@) is thus 
“the average value“ of £? (t. @) in these two respects. 

The reason for time-averaging over {r(r)} is. as we have stated several times. 
that it might not always be suitable to describe this signal as a realization of a stochas- 
tic process. However, when indeed {r(r)} is taken as a realization of a stationary 
stochastic process, (independent of eo), Theorem 2.3 shows that, under weak condi- 
tions, time averages over {r(t)} will, with probability 1, equal the ensemble averages: 


N 
1 
im W 2 roe —t) = E,r(t)r(t — t) wop.1 (8.26) 


Here E, denotes statistical expectation with respect to the r-process. 
This means that e)-ensemble- and r -time-averaging by E will. w.p. 1, be equiv- 
alent to taking total statistical expectation over both eo and r: 


“E = E-E,, wo. 1" (8.27a) 


VO) = Ee? (t.0) = E-E,e(t,0) wp. (8.27b) 


For “hand calculation” it is often easier to apply this total expectation: See 
Examples 8.1 to 8.2. 

[Conversely. one could also replace ensemble averages over ey by time averages 
to eliminate the probabilistic framework entirely: See Problem 8T.1.] 


The General Case 


With a little more technical effort. the results of Lemma 8.2 and Theorem 8.2 can 
also be established for general norms @(¢, 8) as in (7.16), in which case the limit is 
defined as 


V(@) = E€(e(t.6), 0) (8.28) 


The result can also be extended to nonlinear, time-varying models and less restrictive 
assumptions on the data set than D1. See, for example. Ljung (1978a)for such results. 
In summary we thus have 
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6, — arg min Et (e(t. 6). 0). wp. las N > xX (8.29) 


HEDy 


This convergence result is quite general and intuitively appealing. It states that she 
estimate will converge to the best possible approximation of the system that is availuble 
in the model set. The goodness of the approximation is then measured in terms of 
the criterion V(@) in (8.28). We shall dwell on what “best possible” actually means 
in more practical terms in the next two sections. First we give two examples. 


Example 8.1 Bias in ARX Structures 


Suppose that the system is given by 
y(t) tagv(t — 1) = boult — 1) + eult) + coeo(t — 1) (8.30) 


where {u(t)} and {e(t)} are independent white noises with unit variances. Let the 
model structure be given by 


$l) + av(t — 1) = but — 1). 0 = a (8.31) 
The prediction-error variance is 
V(o) = Efx) + ayt — 1) — bult — DF 
= rm + g= 2aao) + p 2bb) + 2aco (8.32) 


where : 
bi + coco — av) — aoco + 1 


J 
rm = Ey (t) = 5 
1 — aĝ 


(see Problem 2E.7). It is easy to verify that the values of a and b that minimize (8.32) 
are 0* =[a* b*]! given by 


* co 
a = & = — 
ro (8.33) 
b* = bo 
These values give a prediction-error variance 
= 5. . aes 
Ve,=1+6qG-—2 (8.34) 
ro 


This variance is smaller than the “true values” 0 = [ao bo ie inserted into (8.32) 
would give: 


V@)=1+¢6 (8.35) 
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When we apply the prediction error method to (8.30) and (8.31), the estimates ây and 
by will converge, according to Theorem 8.2, to the values given by (8.33). The fact 
that a* Æ ao is usually expressed as that the estimate is “biased.” However. it is clear 
from (8.34) and (8.35) that the bias is beneficial for the prediction performance of 
the model (8.31). It gives a strictly better predictor for a = a* than for å = ap. O 


Example 8.1 stresses that the algorithm indeed gives us the best possible pre- 
dictor, and it uses its parameters as vehicles for that. lt is, however, important to 
keep in mind that what is the best approximate description of a system in general 
depends on the input used. We illustrate this by a simple example. 


Example 8.2 Wrong Time Delay 


Consider the system 


x(t) 


bou(t — 1) + ealt) (8.36) 


where 
u(t) = dou(t — 1) + w(t) (8.37) 


and where {e9(t)} and {u'(r)} are independent white-noise sequences with unit vari- 
ances. Let the model structure be given by 


vale) = bu(t — 2). 6=b (8.38) 
The prediction-error variance associated with (8.38) is 


E (y(t) — bult — 2)? = E[bou(t — 1) — butt — 2)} + Ees(t) 


E [(bodo — b)u(t — 2) + bow(t — 1)? +1 


Hence 
by — bodo. w.p. Las N > 20 


since this gives the smallest prediction-error variance. Now the predictor 
Felt — 1) = bodou (t — 2) (8.39) 


is a fairly reasonable one for the system (8.36) under the input (8.37). It yields the 
prediction-error variance 1+ be. compared to the optimal value 1 for a correct model 
and the output variance 

bi 


t+ ~ 
1 — d 


Notice. however, that the identified model is heavily dependent on the input that was 
used during the identification experiment. If (8.39) is applied to a white-noise input 
{u(t)}, the model (8.39) is useless: It yields the prediction-error variance 1 + bs + 
bed3, which is larger than the output variance 1 + 5. Z 
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8.4 CONSISTENCY AND IDENTIFIABILITY 


Suppose now that assumption S1 holds so that we have a true system, denoted by S, 
Let us discuss under what conditions it will be possible to recover this system using 
prediction-error identification. 

Clearly, a first assumption must be that S € M: that is. the set Dr(S, M) 
defined by (8.9) is nonempty. 
S € M: Quadratic Criteria 
The basic consistency result is almost immediate. 
Theorem 8.3. Suppose that the data set Z™ is subject to assumptions D1 and S1. 
Let M be a linear, uniformly stable model structure such that S €e M. Assume 
also that Z™ is informative enough with respect to M. If the input contains output 


feedback then also assume that there is a delay either in the regulator or in both 
Go(q) and G(q, 0). Then 


D, = Dr(S, M) (8.40) 


where D, is defined by (8.22) and (8.23) and Dy (S. M) by (8.9). If, in addition, the 
model structure is globally identifiable at  € Dr (S. M). then 


D. = {6} (8.41) 


Theorems 8.2 and 8.3 together consequently state that the estimated transfer 
functions obey 


Gleit. ôy) > Goality, He’, On) > Hole’). j Wp. Las N > oo (8.42) 
Proof of Theorem 8.3. Let 0 € Dr and consider, for any 9 € Dm. 
V(6) — V) = E lelt, 0) — elt, 0)lelt, %) + tE fel. 0) — elt. 0o]? (8.43) 
Since 0 € Dr, 
elt, 0) = -H (q)Golquct) + H'D = elt) 
according to $1. Moreover, the difference 
e(1,0) — elt, 8o) = F (tleo) — ¥(18) 


depends only on input-output data up to time f — 1 and is therefore independent of 
€o(t) [c£ (8.2)]. (If there is no delay in the system/model. the term u(t) will appear 
here. But then, according to the assumptions, there will be a delay in the regulator. 
so that u(t) and eo(ż) are independent.) The first term of (8.43) is therefore zero. 
The second term, which equals 


E [¥(t|) — $19) 
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is strictly positive if @ and @ correspond to different models, since the data set is 
sufficiently informative; see (8.12). Hence (8.40) follows from (8.23). The result 
(8.41) follows since global identifiability of M at implies that Dr = {8o} [see 
(4.135)]. 


Go € G: Quadratic Criteria 


Often it is more important to have a good estimate of the transfer function G than of 
the noise filter H. We shail now study the situation where the set of model transfer 
functions 


G = {G(e'”.6)|6 € Du} 
is large enough to contain the true transfer function, 
Go € G (8.44) 


but the true noise description Ay cannot be exactly described within the model set. 
Hence S$ ¢ M. We then have the following result: 


Theorem 8.4. Suppose that the data set Z™ is subject to assumptions D1 and 


Si. Let M be a linear uniformly stable model structure, such that G and H are 
independently parametrized: 


0 = [e] G(4.0) = Gq.p), H(q.8) = H(q.n) (8.45) 
and such that the set 
De(S.M) = {plG(e, p) = Golet?) Vo} (8.46) 


is nonempty. Assume that Z is informative enough with respect to M and that the 
system operates in open loop: that is, 


{u(t)} and {eo(t)} are independent (8.47) 


a= [8 
NN 


be obtained by the prediction-error method (8.16) and (8.17). Then 


Let 


Pn —> Dg(S. M) w.p. las N — œ (8.48) 


The result (8.48) can be written more suggestively as 


G(e”, On) > Golel”), w.p. las N —> œ (8.49) 
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Proof. Consider the function V (9) given by (8.22). We have from S1 


e(t. 8) 


H (q. m) [y(t) — Gq. pult) 
H~! (q. n) Gola) — G(q. p) u(t) + Holq)eo(t)] 
ur(t,n.p) + erit. n) 


with obvious notation. Since u an eg are independent, we have that 


VO) = V(p.n) = }[Eu}(t. p.n) + Eet. n)] 
The first term is zero precisely when p € Dg(S, M). and the second term is inde- 


pendent of p. Hence 


argmin V (p.n) = Dcg(S,M) 
p 


irrespective of H , which, together with Theorem 8.2. concludes the proof. 


We may add that both assumptions (8.45) and (8.47) are essential for the result 
to hold. See Example 8.1 and Problem 8E.3. 


The case of independent parametrization (8.45) covers the output error model 
(4.25) along with variants with fixed noise models 


vit) = G(g. O)u(t) + A.(q)e(t) (8.50) 


[which alternatively can be regarded as the output error model used with a prefilter 
L(q) = 1/H,(q). see (7.13) and (7.14)]. It also covers the Box-Jenkins model 
structure (4.31). These model structures consequently have the important advantage 
that the transfer function G can be consistently estimated, even when the noise mode! 
set is too simple to admit a completely correct description of the system. 


Example 8.3 First Order Output Error Model 


Consider the system (8.30) of Example 8.1, and let the model structure be a first-order 
output error model: 


z b 
yale) = ra 


In this case it follows from Theorem 8.4 that the estimates ây and by will converge 
to the true values ay and bo. = 


S e M: General Norm f(e) (*) 


With a general, 8 -independent norm €(€). the estimate converges into the set De: 


6y > D. = arg min E£ (e(t, 0)) (8.51) 
GED ay 
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according to (8.29). In general, the set D,. will depend on £. However. when S € M 
it is desirable that D, = Dr(S. M) for all reasonable choices of £. Clearly. some 
conditions must be imposed on £. and Problem 8D.3 shows that it is not sufficient 
to require €(€) to be increasing with Jej. We have to require ¢(€) to be convex in 
order to prove a result that holds for all distributions of the innovation e)(t). We 
thus have the following extension of Theorem 8.3. 


Theorem 8.5. Let £(x) be a twice differentiable function such that 
Et'(eo(t)) 
Ex } 


0 (8.52) 
ô > 0. Vx (8.53) 


IV 


Here ep(f) are the innovations in assumption S1. Then. under the assumptions of 
Theorem 8.3. 


De = Dr(S. M) 
with D, defined by (8.28) and (8.23). 
Proof. Let 4 € Dr and denote as usual 

Elt. 0a) = e(t) 


Then for any 0 ¢ Dr 
Elt.) = eg(t) + y(t, 8) 


where E[¥(t.9)]”? > 0 since the data set is sufficiently informative. Hence. by 
Taylor's expansion, 


E(e(t,8)) = Efenit)) + HF. AE (ent) + FFU. OF LEW) 


where &(t) is a value between ep(t) and e(t. 0). Since eg(t) and ¥(t.@) are inde- 
pendent. this expression gives 


Et (e(t.@)) 


E€(eo(t)) +0 + 4E [REOT E Ea} 


IV 


=e = wy te 
E€(ey(t)) + a E[S. 0)] > E€ (ea(t)) 


using (8.52) and (8.53) and E [¥(t.6)]° > 0. respectively. This concludes the proof. 


Clearly, an analogous extension of Theorem 8.4 can also be given. 
In the maximum likelihood method the norm £ is chosen as the negative 
logarithm of the PDF of the innovations; (7.85): 


E(x) = — log f.(x) (8.54) 


It can be shown that (8.52) automatically holds for this norm. and that Theorem 8.5 
holds without condition (8.43). See Problem 8G.3. 
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S € M: General Norm £(£. a) (*) 
We consider now the case where the norm is parametrized by an g that is independent 
of the parametrization of the predictor as in (7.17). We thus have that the limit values 
of @ and @ are given by 
(6*.a*) = argmin V(8.œ) = arg min E£ (e(t, @). a) (8.55) 
aa 6.a 


If S € M and the conditions of Theorem 8.5 are satisfied for all æ. then it is clear 
that 6* € Dr(S, M). regardless of a. This means that 


a* = arg min E£ (e(t), a) (8.56) 
Q 


We shall study what (8.56) tells us about the limit value @*. We first have the following 
result, 


Lemma 8.3, Consider a norm (7.17), normalized so that 
x 
| eto dx = 1 Ya (8.57) 
-X 


Let the PDF of eọ(t) be f.(x), and assume that for some a 
Elx. ao) = — log fe{x) (8.58) 
Then œ“ = œ in (8.56). 


Proof. Let f 


falx) = ere 


Hence 
E£ (eo(t), œ) — El (eo(t). do) = -E iog ien 
loo pL = fie EES f _ 
> ee te) = — log [22] kods = —log f falx)dx = 0 


The inequality is Jensen’s inequality (see Chung, 1974) since — log x is a convex 
function, and equality holds if and only if fQ(x) = const - f,(x). This proves the 
lemma. o 


Heuristically, we could thus say that 


the minimization with respect to a in (8.55) tries to make the norm €(€, q) 
look like the negative logarithm of the PDF of the true innovations. (8.59) 
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txample 8.4 Estimating the Innovations Variance 


Let ¢(€. a) be given by (7.87): 


1 [fe 
tea) = = Ë + loga | 
2|a 
We find that 


= 1 [ Eet(t 1A 
E [€(eo(t).@)] = al a ; + loga | = iE + loga | 


which is minimized by œ = Ag. The estimate y will thus converge to the innovation 
variance as N tends to infinity. See also Problem 7E.7. o 


When the parametrizations of the predictor and of the norm £(e. 0) have com- 
mon parameters. the conclusion is that 


6* = arg min Eé (e(t,0).0) 
JeDy 


will give a compromise between making the prediction errors {e(t. @)} equal to the 
true innovations {ep(t)} and (8.59). that is, making the norm look like — log f. (x). 
In case these two objectives cannot be reached simultaneously. consistency may be 
lost even if Dr (S. M) is nonempty. See Problem 8E.2. 


Multivariable Case (+) 


The convergence and consistency results for multivariable systems are entirely anal- 
ogous to the scalar case. The result (8.29) holds without notational changes for the 
multivariable case. The counterparts of Theorems 8.3 and 8.4 with quadratic criteria 


E€(e(t,6)) = $£Ee'(t. O)A~‘e(t, 0) (8.60) 


hold as stated, with only obvious notational changes in the proofs. For Theorem 8.5. 
the condition (8.53) takes the form that the p x p matrix £"(€) should be positive 
definite. 


.5 LINEAR TIME-INVARIANT MODELS: A FREQUENCY-DOMAIN 
DESCRIPTION OF THE LIMIT MODEL 


Theorem 8.2 describes the limiting estimate 6*, 0* € De, as the one that minimizes 
the prediction error variance among all models in the structure M. In case $ € M, 
this means that 6* = Q is a true description of the system (see Theorem 8.3). but 
otherwise the mode! will differ from the true system. In this section we shall develop 
some expressions that characterize this misfit between the limiting model and the 
true system for the case of linear time-invariant models. See also Problem 8G.4. 
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A Note on Data from Arbitrary Systems (*) 


Even though the approximation expressions below take their starting point in a “true 
linear system“ according to assumption S1. it is of interest to note that they are equally 
applicable to data from arbitrary, nonlinear, time-varying systems. Suppose that the 
input and output signals are jointly quasi-stationary, so that their spectra are defined. 
Then we can determine the optimal predictor Wiener filter for predicting y(t) from 
past data: 


Sele — 1) = Walg)u(t) + Wy(g)yte) (8.61) 


where W, and W, are computed from the cross spectra P. Pyy and ®, ina well 
defined way. (See Wiener (1949)and also Problem 8G.5). The error e(t) = y(t) — 
¥(t|t — 1} will by construction be uncorrelated with past inputs and outputs. This 
also means that ég(k). k < t will be uncorrelated with eg(t). since it is constructed 
from past input-output data. If we introduce Hy(g) = (1 — W, (g))"! and Go(qg) = 
Ho(q)W.,.(q) (cf (4.114)), we can rewrite (8.61) as 


y(t) = Golqg)u(t) + Ao(q)eo(t) (5.62) 


where é€9(f) isan uncorrelated sequence. and uncorrelated also with past input-output 
data. In general. independence will not hold. but in the calculations to follow we 
will only utilize the second-order properties of the data. All this means that (8.62) 
is a correct description of the observed data, if only their second-order properties 
are considered: “the best linear time-invariant approximation of the true system." (h 
may still be useless for control and decision-making. though. since this approximation 
depends on the actual data spectra.) 


An Expression for V(0) 


54. 


By the fundamental expression (2.65). we may write 


1 T 
V(6) = Eie (t.0) = gg | _ Pe(w- Ddw (8.63) 


Tea 


where ®,(w. @) is the spectrum of the prediction errors {e(t.@)}. Under assump- 
tions S1 we have 


y(t) = Gol(q)ult) + Ho(q)eu(t) (8.64) 


where the noise source ey has variance 25. Then, for a linear model structure. we 
obtain the prediction errors 


e(t,@) = H-'(q,9)[¥(t) — G(q.@)utt)] = Hy" [(Go ~ Ge)u(t) + Aoey(t)] 
Hz" ((Go — Ge)u(t) + (Ho — Ha)eo(t)] + eolt) 


P u(t) _ 
H; '[(Go — Ge) (Ho — Hs) | $ 7 + e9(t) (8.65) 
0 
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Here, we introduced the shorter notation Hp = H(q.@) and similarly for Gg. 
Assume now. as in Theorem 8.3. that the system mav be under feedback control. 
but is such that there is a delay either in the system and the model (i.e. Gy and 
Gy both contain a delay), or in the regulator (so that u(t) may depend only on 
v(t — 1) and earlier values). Note also that. since both Hy and Hy are monic. the 
term (Hy — Ha)eg(t) is independent of ep(t). All this means that the innovation 
€o{t) will be uncorrelated with the first terms in (8.65). Then. according to (2.95), 
the spectrum of £ can be written 


©,(w,0) = [(Go — Ge) (Ho — Hy)] 


1 
| Ha|* 


®, Due (Go = Ga) 
x — — THaAg (8.66) 
Pou Ao (Ho — Ho) 


Overbar here means complex conjugation. Peu = Èe is the cross spectrum between 
eg and u. which of course is zero if the system operates in open loop. Note that the 
data spectrum can be factorized as 


Pu Dye I O|| , 0 I $e 
= : u 8.67 
es slee aell F] e 


Let us introduce 


(Ho(e'”) = H(e'®, 6)) Die(w) 


Ble 0) = 8.68 
P, (w) (Pes) 
Then, using (8.67), (8.66) can be rewritten 
3 a (Duel? 
lHo — Hel" {ào — —— 
IGo + Bo — Gal P, ®, f 
OM (ON: EaR aE tidy (8.69 
none [Hal Hol? eee 
We now have a characterization of 
D. = arg min V(@) (8.70) 


in the frequency domain. We see that if there is a parameter value 6) such that 
Ga, = Go and Hg, = Ap. then this value will minimize the integral of (8.69). since 
the first two terms then vanish. This we knew before: it is a restatement of Theorem 
8.3. We have however also obtained a more explicit picture of how the model is 
approximating the true system in case it cannot be exactly described. We see that 
Gg will be pulled towards Go + Bs and Ha to Ho with the indicated weightings. To 
be more specific, let us investigate some special cases. 
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Open Loop Case 


If the system operates in open loop, u and e will be independent. so Pue = 0. and 
hence B = 0. If the noise model is fixed to H(g.9) = H,(q). (8.70) and (8.69) 
specialize to 


T | 
D, = argmin f IGo(e’’) ~ Glet. 0| O.(w)dw (8.7la) 
° T 


®,,(@) 


Q,(w) a 
|H, (ef) j” 


(8.7 1b) 


where we disposed of the 6 -independent terms. Let 0* € De. In this case the limiting 
model G(e'”, 6*) is a clear-cut best mean-square approximation of Gy(e'”), with a 
frequency weighting Q.. This depends on the noise model H, and the input spectrum 
®,,, and can be interpreted as the mode] signal-to-noise ratio. 


Consider now an independently parameterized noise model with 0 = fp 7] 
as in (8.45). (4.128). We can define the spectrum of the output error 


(Golg) — Gg. pjut) + Holqg)eo(t) 
for each value p as 
Per(w. p) = |Gole'”) — Ge, p|? Pulo) + ào | Hote”) |? a 
= B, |R(e. p|? 


where the last equation is a definition of the spectral factor R. This is a monic 
function; see the spectral factorization result in Section 2.3. Then the integral of 
(8.69) takes the form 


T |Go — GPO 
| i i a al ~ + hy 
aes |H; l- H 


f Per. P) g 
— a) 
-7 |H (ei, n)| 


_ x Rie”, o? D f Rel”, p) 2 
7 es me as fee nm i ae 


f. 
-7 


where we twice used the identity f |R} dw = f (IR -— 1 +1)dw for a monic 
function R. 


1 1 


a RS ate TEE TSA E T, Q% ` d 2 
Hey Resp) OROT ae 
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Now. we can characterize the parameters that minimize the criterion function. 
Let 0* € D.. Then we can write @* =[p* n* J. with 


p* = arg min f |Go(e’’) — Gel®. p) O,(@.n)dw  (8.73a) 
p J-a ' 


P(w) 
Hle”, n)|? 


T 
§ 

= 
I 


(8.73b) 
g . F! 1 1 
| 


arg min —— — ————__| . p' )dw (8.73 
g an Hen) T Rep) ER(@, p*)dw (8.73c) 


3 
II 


Here Pep and R, are defined by (8.72). In this case we see that G(e'”, p*) is 
fitted to Golet”) in the O(w. n*) norm. This norm is not known a priori, but will 
be known after the minimization, once n* is computed. We also see that the noise 
model H (e'”, n) is fitted to describe the spectrum of the output error. 

In the general case, where G and H share parameters, no clear-cut formal 
characterization of the fit of G can be given. It is useful, though, and intuitively 
appealing to see the resulting estimate @* as a compromise between fitting Ga to Go 
in the frequency norm ®,,/| He» |? and fitting the model spectrum | H4]? to the output 
error spectrum DeR(w. 0”). 


Closed Loop Case 


Let us assume that the regulator is linear. so that u will be a linear function of the 
reference signal r and the noise source eg as in Assumption D1. We can then split up 
the input spectrum ®, into that part that originates from r and that part that comes 
from ey: 
Dulo) = Pi (w) + Dw) (8.74) 
If the linear filters that define the input are time-invariant. u(t) = Ky(g)r(t) + 
K1(q)eéo(t), we find that Ọue(w) = 49K2(e'®) and hence 
IPue(@)|? = oi (w) (8.75) 
This means that we can characterize the size of the “bias-pull™ term B in (8.68) more 
explicitly as 
Bie”. oV = ——— - — - |Ay(e!”) — Ale”. 8) (8.76) 
| | P, (w) are | oe | 
and the resulting parameter 0* will be the value that minimizes 
T l i > Ọ, 
J [Gole t) + Ble’. 6) — Gier oL dow 


a [H (ei. o| 
(8.77) 


P [ | Hole") = H (e, 0) D’ (w) 
J-a |H o? Pul) 


We shall investigate this expression further in Section 13.4. 
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Example 8.5 Approximation in the Frequency Domain 


Consider the system 

y(t) = Go(q)u(t) 
with 
0.001g~°(10 + 7.4g7! + 0.924g~? + 0.1764q73) 
1 — 2.14g-! + 1.553g-? — 0.438747? + 0.042q-4 
No disturbances act on the system. The input is a PRBS (see Section 13.3) with basic 
period one sample. which gives P(w) ~ 1 all w. 


This system was identified with the prediction-error method using a quadratic 
criterion and prefilter L(g) = 1 in the output error model structure 


Go(q) = (8.78) 


bq”) + bzg? 
R T 
Deb fig t pa~ 
Bode plots of the true system and of the resulting model are given in Figure 8.2a. 
We see that the model gives a good description of the low-frequency properties but 


is bad at high frequencies. According to (8.71). the limiting model is characterized 
by 


$010) = (t) (8.79) 


z 
0* = arg min f |Go(e'”) — Gle? 0) do (8.80) 
6 -xn 


since H*(q) = 1 and ®,(w) = 1. Since the amplitude of the true system falls off 

by a factor of 107? to 107? for w > 1, it is clear that errors at higher frequencies 

contribute only marginally to the criterion (8.80): hence the good low-frequency fit. 
Consider now instead an ARX structure 


biq7! + bq u(t) + 


OT ee re Ss ee —— ss 
vit) 1+ ajq7! + aq- 1+aq'+aq~ 


e(t) (8.81) 


0.01 0.01 


0.1 1 10 0.1 1 10 
frequency (rad/s) frequency (rad/s) 


Figure 8.2 Amplitude Bode plots of true system (thick lines) and model (thin 
lines). (a) OE model estimated in (8.79). (b) ARX model estimated in (8.81). 
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corresponding to the linear regression predictor 
FDO = —ayv(t — 1) — arv(t — 2) + bute — 1) + bault — 2) 


When applied to the same data, this structure gives the model description in Figure 
8.2b, with a much worse low-frequency fit. According to our discussion in this section, 
this limit model is a compromise between fitting 1 /}1 + a,e'” + aze™® |} to the error 
spectrum and minimizing (a; and af correspond to the limit estimate 6* ). 


* dw (8.82) 


a 3 g ati 
| |Go(e’”) S Gle'’, ay i |1 ma ave” + ate" 
-7 

The function ERGA = |1 + ae” + azeet is plotted in Figure 8.3. It assumes 
values at high frequencies that are 10* times those at low frequencies. Hence, com- 
pared to (8.80), the criterion (8.82) penalizes high-frequency misfit much more. This 


explains the different properties of the limit models obtained in the model structures 
(8.79) and (8.81). respectively. o 


10 


~ bao mne. 


0.1 


0.001 


0.1 1 
frequency (rad/s) 


2 


Figure 8.3 The weighting function | A.({e’”)|" in (8.82). 


8.6 THE CORRELATION APPROACH 


In Section 7.5 we defined the correlation approach to identification, with the special 
cases of PLR and IV methods. The convergence analysis for these methods is quite 
analogous to the prediction-error approach as given in the previous few sections. 
Basic Convergence Result 


Consider the function 


ee 
fu(@,Z™) = ~ DSU, erlt, 6) (8.83) 


i=l 


270 


Chap.8 Convergence and Consistency 


where £F is given by 


er(t,0) = L(g)e(t. 6) (R34) 
and the correlation vector ¢ (t, @) is obtained by linear filtering of past data: 
g(t, 0) = Ky(q. 0) y(t) + Kulq, @ult) (8.85) 


(both filters contain a delay). Determining the estimate Ôn by solving fy (0.Z*)=0 
gives the correlation approach (7.110). We have here specialized the general instru- 
ments (7.110b) to a linear case (8.85). 

The convergence analysis of (8.83) is entirely analogous to the prediction-error 
case. Thus we have from Theorem 2B.1: 


Lemma 8.4. Let the data set Z™ be subject to D1, and let the prediction errors be 
computed using a uniformly stable linear model structure. Assume that the family 
of filters 


{K,(q,9), Ku(q.9);9 € Dm} 


is uniformly stable. Then 


sup | fv(9,Z%) — F(@)| > 0, wp.lasN > œ (8.84) 
PED 


where = = 
FO) = ESC, Pert, 8) (8.87) 


For the estimate Êy , we thus have the following result. 


Theorem 8.6. Let Ôn be defined by 
Ôn = sol 6, Z“) =0 
a ye E ) ] 


4 
Then, subject to the assumptions of Lemma 8.4, 
6y > De, w.p. lasN > œ (8.88) 
where = 
Dep = {010 € Dm, f(0) = 0} (8.89) 


The theorem is here given for the special choice a(€-) = er in (7.110). The 
extension to general a@(-) is straightforward. 

This convergence result is quite general and also quite natural. The limiting 
estimate 0* € Def will be characterized by the property that the filtered prediction 
errors {€f(t.9*)} indeed are uncorrelated with the instruments {¢ (r, 0*)}. This 
was also our guideline when selecting Ên. We shall now characterize Def in more 
practical terms for some special cases. 


S e M: Consistency («) 


An assumption S € M would in this case be that there exists a value 0% € Da 
such that {e(t, 9) = eo(t)} is a white noise. With L(q) = 1. we thus find that 
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f(%) = 0, since eo(t) is independent of past data and in particular of C(t. 6). 
Hence, as expected. 
Oo € Def (8.90) 


Whether this set contains more elements when the data are informative and the 
model structure globally identifiable at 6 is not so easy to analyze in the general 
case. 


Go € G: Instrumental-variable Methods 
Consider the IV method with instruments 
$(t,8) = Kulq.@)utt) (8.91) 
The underlying model is 
A(q)y(t) = B(q)u(t) + v(t) 
for which the predictor 
(110) = "(00 


is determined as in (4.11) and (4.12). 
Under assumption S1. the true system is given by (8.7). If there exists a 0 
corresponding to (Ay(q), Bo(qg)) such that 


Bo) 
G = ; Go € G: cf. (8.44 
olą) Aa) [Go € G: cf. (8.44)] 
we can consequently write (8.7) as 
any — Boa) 
yt) = re ao + Ho(q)eo(t) (8.92) 
or 
y(t) = p (o + wolt) (8.93) 
where 
wolt) = Aolq)Holq)eolt) (8.94) 


Suppose now that the system operates in open loop so that {wo(t)} and {u(t)} 
are independent. Then 


FO) = Ett, OLE [P — 0) + wolt)] 
= [Ezt @=)gF()] @ — 8) (8.95) 
where 
grt) = Lio) (8.96) 


The second equality in (8.95) follows since ¢ (t, 0) is entirely constructed from past 
u(t), while L(g)wo(t) is independent of {u(1)}. Under the stated assumptions, we 
thus have that 6) € Def. and whether this set contains more @-values depends on 


whether the matrix Eg (r, @)y/(r) is singular. 
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Suppose now that the instruments ¢ do not depend on @ and are generated 
according to (7.122) to (7.124). The matrix 


= Eg(t)pP(t) (8.97) 


is then a constant matrix that depends only on the filters L(g), K(q), N(q). and 
M (q). on the true system. and on the properties of {u(t)}. A thorough discussion of 
the nonsingularity if R is given in Söderström and Stoica (1983). We first note the 
following facts. Let n°. n° be the orders of the true description (8.92), and let n,;. n, 
be the corresponding model orders. Let the orders of the instrument filters (7.122 
to (7.124) be n, and n,,. Then 


1. If min(a, — no ny — n9) > 0. then R is singular. (8.95a)} 
2. If min(na — An. Nb — Nm) > 0, then R is singular. (8.98b) 
To see this. let 


Bo(q) 


wl) = 
y(t) Aa) 


u(t) (8.99) 


g(t) = [ -zo(t — 1)... —Zo(t — na) u(t — 1)...u(t — n) ]" 


Let or (t) = LIQ). If na > n? andn > ný, then (8.99) implies that there exists 
an (a + np)-dimensional vector S such that (NS = 0. Then also GF (t)S = 0. 
Now, since {wo(t)} and {u(z)} are independent, we have 


= Etl) = Erg) (8.100) 


which shows that RS = 0 and R is singular. Similarly, (8. ay implies the existence 
of a vector S such that S7 ¢(r) = 0. 

When neither of (8.98) hold. the matrix R is ° eniai nonsingular. To 
show this. the reasoning goes as follows: For a given true system and a given input. 
denote the coefficients of the filters L, K. N, and M by p. The matrix R is thus a 
function of p : R(p). Now consider the scalar-valued function det R(). This is an 
analytic function of p (see Finigan and Rowe, 1974). If such a function is zero for p 
in a set of positive Lebesque measure, then it must be identically zero. If we can find 
a value p* such that det R(p*) Æ 0. we thus can conclude that det R(p) Æ 0 for 
almost all p (in the set where det R is analytic). Such p* can be found if the input 
spectrum ®,(w) > 0 for all w and the orders of the filters N and M are chosen al 
least as large as the corresponding model orders ną and n, (see Problem 8T.2). We 
thus have the following result. 


Suppose that the system is given by (8.92), that ®,(w) > 0, and that u and 
ey are independent. Let the instruments ¢(t) be given by (7.122) to (7.124). 


Assume that neither of the conditions in (8.98) holds. Then R in (8.97) is 
nonsingular for almost all such choices of N, M, L.and K. (8.101) 
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Frequency-domain Characterization of D.; for the IV Method (+) 


The prediction errors, under assumption Sł, can be written 
e(t.0) = A(g)y(t) — B(q)u(t) 


B 
= A(q) | Gotan + Hol(g )eott) — Run] 


A(q) 
using (8.92). With the instruments given by (8.91), we thus have. analogous to (8.63) 
to (8.71). 
FO) = Ezt. er (t.6) 
1 g iw iw 
= xz | [ote ) — Ge”. 6)] 
x b,(w)A(e™)L(e!) Kule, 9) dw (8.102) 
with , 
. B(e'”) 
G bade =. 
(e7. 0) MET 


Here K„(e™®. @) is a d-dimensional column vector. 

The limiting estimates 6, € Dez are thus characterized by the fact that certain 
scalar products with the error Go(e’®) — G(e'®, Oy) over the frequency domain are 
Zero. 


8.7 SUMMARY 


In this chapter we have answered the question of what would happen with the esti- 
mates if more and more observed data become available. The answer is natural: We 
have for the prediction-error approach (7.155) that 


éy — arg min Ef(e(t.6),0).  wp.lasN > œ% (8.103) 
BEDm 


and for the correlation approach (7.156) that 
éy > „50l (Ezt. O)a(er(t,0)) = 0], wp. iasN > œ (8.104) 
Ebay 


These results were proved in Theorems 8.2 and 8.6, respectively. 

The limiting models are thus the same as those that we could have computed 
as the best system approximations, in case the probabilistic properties of the system 
had been known to begin with. 

In case a true description of the system ts available in the model set, the model 
will converge to this description under certain natural conditions on £, provided the 
data set is informative enough. This was shown in Theorems 8.3 to 8.5. 

When no exact description can be obtained, the model will be fitted to the 
system in a way that for linear time-invariant models can be characterized in the 
frequency domain as follows [see (8.71) to (8.77)]: 
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The limiting transfer function estimate G*(e'”) is partly or entirely de- 
termined as the closest function to the true transfer function, measured 
in a quadratic norm over the frequency range, with a weighting function 
2 l f . 2 
P,,(w) / Hle”, 0*)| , while the resulting noise model |H (e®, 6*)| re- 
sembles the output error spectrum PeR(w) as much as possible. (8.105) 


We have not specifically addressed the convergence and consistency of the subspuce 
method (7.66) for estimating state-space models. That will be done in connection with 
the algorithmic details in Section 10.6. However, a heuristic convergence analysis 
follows from the results of this chapter: As s and N increase to infinity (s much 
slower than NV) the k-step ahead predictors in Y will converge to the true ones. This 
means that the procedure will extract a correct state vector, which in turn means that 
the second LS-step will estimate the system matrices A, B, C. D and K consistently. 
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8.9 PROBLEMS 


8G.1 


8G.2 


8G.3 


8G.4 


Local minima: Consider the matrix °3(@) defined in Problem 4G.4 and recall that the 
condition 
r-(8) > 0 (8.106a) 

implies local identifiability of the model structure at 0. Show that this condition together 
with the condition 

xow) > 0. Vw (8.106b) 
[Ð xo(w) is the spectrum of x(t) defined in (8.14)] implies that 

Ew(t, w(t.) > 0 


where w(t, 6) as usual is (¢/d0)¥(1|0). Show also that if e(¢. 64) = eo(t) is white noise 
and if 
Ef leot) = 0. Eiet) > 0. 
then the conditions (8.106) (at A = 4) imply that V (9) = Eĉ (e(t. @)) has a strict local 
minimum at A = Qp. 
Suppose that the transfer-function model set {G(q.@)} consists of mth-order linear 
transfer functions: 
big +--- + b,q" 
1+ ajyq7) + +++ + ang" 


and suppose that the input consists of n sinusoids of different frequencies, w;. i = 
Tiska n.O<w; <T. 


Gq.9) = 


(a) Suppose that the noise model is independently parametrized. Show that the 
limit model G*(e'”) fits exactly to the true system Go(e'”) at the frequencies in 
question. It is consequently the same result as if we applied frequency analysis 
with these input frequencies. 

(b) Suppose that the noise model has parameters in common with G(q. 0}, but that 
the system is noise-free: ® (w) = 0. Show that the result under part (a) still 
holds. 


Consistency of the maximum likelihood method: Suppose that the conditions of The- 
orem 8.5, except (8.52) and (8.53). hold. Let 


t(x) = —log f(x) 
where f.(x) is the PDF of eo{t). Show that (8.52) holds and that the theorem holds 


even if (8.53) is not satisfied. 


Suppose that the system is controlled by output feedback: u(t) = r(t) — F(g)v(e). 
Show that V(@) in (8.63) then also can be written 


IG(el”. 8) — Gole')|- 

Via) = e 
vo = | enp o 
| Glet 0) Fei” 1? 1 
f 1 + Gle”. OF (el) | D, (wdw 


1 + Golei2)F (eie) è |Htei®, a) 
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where ®" is defined by (8.74). Suppose that we have parameterized G. so that there 
is a O for which G(g. @)) = Goig). and that no noise model is of interest. Use the 
above expression for V (8) to suggest a parameterization of H (q. 0). such that Ay will 
converge to 6) even though no correct noise model is obtained. 


8G.5 Let {u. y} be an arbitrary, quasistationary input-output data set. and consider the output 
error linear model with prediction errors £(t, 0) = y(t) — G(q. @)u(t). Assume that 
G starts with a delay. 


(a) Show that 


Puw) 
(w) 


= 1 z 
Ee?(t.0) = 5 — Ge”, J ®,(w)dw + O—independent terms 


(b) Let ®, be factorized as P,(w) = Le”) L(e—'”) for L such that both L and 
L~! are stable, and causal. (That L is casual means that 


oc 
Lie”) = ye Gee 
k=1 


Now define, 


Rel”) = a = ne” 


5 ree -iw + Donets A Ra (e) +R (e”) 


=—x< 


where we have split R into an anti-causal and a causal part. Define Gole”) = 
R,(e'®)L(et”), and show that / 


= 1 f7 i ; 2 , 
Ee7(t,@) = =| IGal?) — G(e'®, 0) D, (w)dw + 6—independent terms 
-īa 


(c) Show that if a fixed noise model H, is used. the corresponding expression becomes 


Pu (œ) | l H,(e'®) 
L(e”) Heet) Jansa Llet) 


Gole?) = l 


where [-]causa Means the causal part. Show also that e(t) à HI'N - 
Go(q)u(t)) will be uncorrelated with past inputs. for the Go thus defined. (Hint: 
Derive the cross spectrum between u and e and show that is an anti-causal func- 
tion.) 


8G.6 Consider (8.75) and assume that both Go(e'’) and G(e'”. @) are causal. Show that the 
“bias-pull”-function B(e’®, 8) can be replaced by 


, H ioy _ H iw g Due H iw 
Gele”. 0) = $ ole) (e'’, 0)) 2] | Hte'’. 6) 
causal 


Lite) H(e@, 8) Lie”) 


with notation as in the previous problem. 


8E.1 


8E.2 


8E.3 


8E.4 
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Apply Theorem 8.2 to the LS criterion (7.33) and verify the heuristically derived result 
(7.38). 


Consider Problem 7E.4. Here the criterion function £ {e(7. 8}. @) is not parametrized 
independently from the parametrization of the predictor. Suppose that the true system 
is given by 


x(t +1) = anx(t) + wolt) 
M(t) = X(t) + tut) 


where {uo(t)} and {vo(ż)} are independent. white Gaussian noises with variances 
Euz(t) =0.1 and Ev;(r) = 10. respectively. 


(a) Show that there exists a value 0* in the parametrization (7.163) such that 
e(t.0*) = ealt) = the true innovations 


(b) Show that the maximum likelihood estimate AM. does not converge to @* as N 
tends to infinity. 


(c) Explain the paradox. 


Consider the output error model structure 


(tle) = wt — 1). A= H 
Suppose the true system is as in Example 8.1: 
¥(t) + aov(t — 1) = dou(t — 1) + eolt) + Coeatt — 1) 
and that the input is generated as 
u(t) = —kov(t) + r(t) 
where (r(t)} is white noise and 
lan + kobol < 1 


Give an expression that characterizes the limit of Ôw. Will it be equal to [aq by 7? 
On what point are the assumptions of Theorem 8.4 violated? 


Consider the model structure 


v(t) + ax(t — 1) = bult —1) + elt) @= H 


Suppose that the true system and input are given as in Problem 8E.3. Let 6x be the IV 
estimate of @ with instruments 


dip [i — 1) 
c= u(t — 2) 


A ao 
Does 6x converge to 6) = f i 
0 
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8E.5 Consider the expression for ®, in (8.69). Suppose that the feedback can be described 


8T.2 


8T.3 


as 
u(t) = K(q)e(t) + r(t) 


Show that the error spectrum can then be written as 


Paar 2 PO l Su 2 

C p, (w) + [Giet ok e*n + Äe’? 6)| ia} 

eS 
[Hier @)| 
Consider a quasi-stationary sequence {u(t)} and a uniformly stable family of filters 
{G(q4,9).9 € Dm}. Let 
2(t.0) = G(q, 9)u(t) 

Use the corollary to Theorem 2.2 (Appendix 2A) to note: For each 0 € Day. (2(7.4)} 
is a quasi-stationary sequence and 


N 


1 > 
HW dae tt. 8) — R0) 


t=1 


sup — 0. a N > œ 


GED rg 


Here 


= 1 ft > 
Ro(0) = Ez (t.0) = al Gle”, 60) Duw) dw 
27 Jx 


Use this result to give a probabilistic-free counterpart of Theorem 8.2: For any quasi- 
stationary deterministic sequences {y(t)} and {u(t)}. 


Ôn — argmin Ee°(t, 6) 
GED 


where 
= Pe f 
2 — i 2 
Ee*(t,@) = Jim m 2 E (t.0) 


See Ljung (1985d)for a related discussion. 


Consider the situation of (8.101). Let the true system be given by (8.92) and suppose 
that the filter choices are as follows: 


Liq) = K(q) = 1 
N(q) = Ao(q) 
M(q) = Bo(q) 


Suppose ®,(w) > 0 for all w. Show that R(p) is positive definite {and hence nonsin- 
gular) for this choice of filters. 


Let the system be given by 
S: y(t) = Golq)u(t) + Ao(qeo(t) 
and an underlying model by 


M: v(t) = Golg.9)u(t) + Hol(g. Pelt) (8.107) 


8D.1 


8D.2 
8D.3 
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Suppose that 
Dr(S, M) = {6} 


and that the data are informative. Now let 
(le — k; 8) 


be the k-step-ahead predictor computed from (8.107). and let Ay be determined by 


N 
å . 1 à x 
Ox = arg min 5, J IO — e — ko OY 


t=] 


Show that On — 0, as N — oo. provided there is no output feedback in the generation 
of the input. What happens if there is feedback? What happens if there is feedback. 
but there is a k-step time delay between input and output? (Hint: Note that the k- 
step-ahead predictor is a special case of the general linear model so that Theorem 8,2 
is applicable. Try to copy the technique of Theorem 8.3.) 


Show that if L 
sup |fn(x)— f(x)| > 0 N> 
-I<x<l 

and 

Xy = arg min fy(x) 

-I<x<! 

then _ 

Xn — arg min f(x) 
Show that (8.20) follows from (8.18). (8.2), and (8.5). 


To show that it is not sufficient to require €(x) to be increasing with |x| in Theorem 8.5, 
consider the following counterexample: 


0, x=0 
x) = 4 2x. 0 < |x| < 5 
1, klz} 


E _ j +1. with probability 1/2 
EM Op) COU) | —1, with probability 1/2 
Suppose there exists a value @ such that ${116) — $(t 10%) has the following distribution: 
+1, with probability 1/2 
—1, with probability 1/2 


Check that 
(a) Condition (8.52) of Theorem 8.5 is satisfied. but not (8.53). 
(b) Ee7(1,0) > Ee?(t, 0o). 
(c) E£ (e(r. 8)) < E€(e(t.9)) so that Oy > %. 
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ASYMPTOTIC DISTRIBUTION 
OF PARAMETER ESTIMATES 


9.1 INTRODUCTION 
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Once we have understood the convergence properties of the estimate Ên . the ques- 
tion arises as to how fast the estimate approaches this limit. If 0* denotes the limit. 
we are thus interested in the variable Ôn — 0*. We shall impose the same stochas- 
tic framework on the estimation problem as in the previous chapter. This means 
that 6y — @* will be a random variable, and its “size” can be characterized by its 
covariance matrix or, more completely, by its probability distribution. It would be 
a difficult task to compute the distribution in the general case for any N. and we 
will have to be content with asymptotic expressions for large N. It will turn out that 
(Ôn — 6*) typically decays as 1//N. so the distribution of the random variable 


J/N(6y — 6*) 


will then converge to a well-defined distribution, which will turn out to be Gaussian 
under weak assumptions. 

In this chapter we shall derive expressions for these asymptotic distributions. 
The chapter is structured analogously to Chapter 8. Thus, in Section 9.2 we give the 
basic result for prediction-error methods. Section 9.3 gives explicit expressions for 
the asymptotic covariance matrices in cases where the limit 0* gives a correct descrip- 
tion of the true transfer function. Frequency-domain expressions for the variances. 
as well as the variance of the resulting transfer function, are derived in Section 9.4. 
The correlation approach to parameter estimation is treated in Section 9.5, and Sec- 
tion 9.6 deals with the practical use of the results derived in the chapter. The reader 
might find it useful to first review the analysis in Appendix I, Section IT.2, which could 
serve as a simplified preview of the techniques and results of the present chapter. 
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9.2 THE PREDICTION-ERROR APPROACH: BASIC THEOREM 


Heuristic Analysis 


Consider. as before, 


6y = arg min Vy (8, Z“) (9.1) 
de Da 
ee 

Vx(0. Z“) = FEto) (9.2) 
i=l 


Then, with prime denoting differentiation with respect to 8, 
Vy Oy. Z“) = 0 


Suppose that the set D,, in (8.23) consists of only one point, 0*. Expanding the 
preceding expression into Taylor series around 6” gives 


0 = V,(0*.Z%) + Vilén, Z*) (Oy — 6°) (9.3) 


where Ey is a value “between” Êy and 0*. We know that Ôn — 6 w.p. 1. By 
arguments analogous to Lemma 8.2, it should be possible to show that Vẹ (0, Zy) 


converges uniformly in 6 to V"(@). Then 
Vi (ën. Z“) > V"(0").  asN > œwp. it (9.4) 


Provided this matrix (d x d) isinvertible. (9.3) suggests that for large N the difference 
is given by 


(6y — 0*) = -[V"(6*)] 1V,(6*. Z“) (9.5) 
The second factor is given by 


N 
749% N 1 * * 
-Vi (0*, Z“) = a X y(t. O* et, 8) (9.6) 


t=! 


where, as usual, 


d d 
1,6") = —— e(t. 0)la=e = —¥(t|6) lomo 
y(t, 6") 70° )la=0 Jg! |? )lo=0 
is a d-dimensional column vector. By definition 


V'(0*} = —Ew(t.6*)e(t, 0") = 0 
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Apart from the difference 


1 * x T x * 
Dy =E È B je(t.0”) — Ey (t.0°)e(t.8 n (9.7) 


which we assume to tend to zero “quickly.” the expression (9.6) is thus a sum of ran- 
dom variables y(r. 6*)e(t, @*) with zero mean values. Had they been independent, 
it would have been a direct consequence of the central limit theorem that 


pew 8*)e(1,0*) € AsN(0, Q) (9.8) 
with 
Q = lim N- E [Vy @". Z* [Vy 8". ZT} (9.9) 


Here (9.8) means that the random variable on the left converges in distribution to 
the normal distribution with zero mean and covariance matrix Q {see (1.17)]. The 
terms of V; are not independent. but with assumptions D1 and (8.5) the dependence 
between distant terms will decrease. It seems reasonable that (9.8) will still hold. 

If (9.8) holds, we have directly from (9.5) that 


VN(ôy — 6*) € AsN(O, Pa) (9.10) 
Pa = [V"(6*)] oV" e (9.11) 
Asymptotic Normality 


The preceding heuristic analysis can be rigorously justifigd as shown in Appendix 
9A. The result is summarized as follows. 


Theorem 9.1. Consider the estimate Êy determined by (9.1) and (9.2). Assume 
that the model structure is linear and uniformly stable (see Definitions 4.3 and 8.3) 
and that the data set Z™% is subject to D1 (see Section 8.2). Assume also that for a 
unique value 9* interior to Dm we have 


Ôr > 6". w.p. las N — œ 
vV”@*) > 0 
and that 
VN Dn — 0. aN > x 


with Dy defined by (9.7). Then 
VN (ôy — 0*) € AsN(O, Po) 
where Py is given by (9.11) and (9.9). 
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With more technical effort the same result can also be shown under relaxed 
assumptions on the data set and the model structure (see, e.g.. Ljung and Caines. 
1979). Notice in particular that the result holds. without changes, for general norms 
(e, 0.t). provided € is sufficiently smooth in £, and @. The calculations of the 
derivatives Ví, and V” become more laborous, though. For 


A 
; 1 
N =-— 
Vy(0,Z") = 7 2 (9.12) 
we have 
1 N 
V0. Z“) = y Lr Oe, (6(t.0).0.t) + Eg (e(t.0).6.0)] (9.13) 


t=] 


where £7 is (0/de)€(e. 0. t) and £; is the d x 1 vector (8/00) €(e, 0. t). Similarly. 
V") = Eya. 0E (E(t. 0). 0.00 (t, 0) 
— ð = " . 
— Ezg vt, OE (E(t. 8).9. 1) — Ew(t, 0)€-g(e(t. 6). 6. t) (9.14) 


— Eé,,(e(t.0).0.t)w(t.0) + Ebga(e(t. 9), 0.1) 


with obvious notation for the second-order derivatives. We shall allow ourselves to 
use the asymptotic normality result in this more general form, whenever required. 

The matrix P, in (9.11) is thus the covariance matrix of the asymptotic distri- 
bution. When our main interest is in the covariance, we shall write 


In Appendix 9B a formal verification of (9.15) is given. 

The expression for the asymptotic covariance matrix Py via (9.9). (9.11), (9.13). 
and (9.14) is rather complicated in general. In the next section we shall evaluate it 
in some special cases. 


9.3 EXPRESSIONS FOR THE ASYMPTOTIC VARIANCE 


In this section we shall consider the asymptotic covariance matrix for prediction- 
error estimates in case, essentially, a true description of the system is available in the 
model set. 
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S € M: Quadratic Criterion 


Assume that the conditions of Theorem 8.3 hold. Then D. = {@} and E£(t., 6} = 
eo(t) is a sequence of independent random variables with zero mean values and 
variances Ag. From (9.6) and (9.9) we then have 


N N 


„lim 1 Ewe. O)eg(ten(s)W '(s, 8o) 


t=] s=1 


Q 
(9.16) 


lim MEW ODW M(t. 6) = AE W(t. BW E, A) 


N> 
since {eo(t)} is white (see also Problem 9D.1). Similarly. 


V” (Oo) = Ey (t, W(t.) — Ew'(t. eo(t) = Eytt. OyT. %) 


The last equality follows since y’(t, 8o) is formed from Z'—! only. Hence. from 
(9.11), we obtain 


Py = do [Ev tt, 67. )] 


This result has a natural interpretation. Since W is the gradient of f. we see that the 
asymptotic accuracy of a certain parameter is related to hoy sensitive the prediction 
¥(t|@) is with respect to this parameter. Clearly, the ack a parameter affects the 
prediction, the easier it will be to determine its value. A very important and useful 
aspect of expressions for the asymptotic covariance matrix like (9.17) is that it can be 
estimated from data. Having processed N data points and determined On. we may 
use 


N —1 
iA rig 1 a T A 
Py = Àn [F Erei z (9.18) 
N 
hy = pa (r, Ow) (9.19) 


as an estimate of Pg. See Problem 9G.3 for the case in where S “almost” belongs to 
M. 


Note also that according to Problem 9G.5, the estimate ÅN will be asymptoti- 
cally uncorrelated with 6y. 
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Example 9.1 Covariance of LS Estimates 


Consider the system 
(9.20) 


ae 
The input {u(t )} is white noise with variance jz, independent of {e9(t)} which is white 
noise with variance Aj. We suppose that the coefficient for u(t — 1) is known and 


v(t) + ag v(t — l) = u(t — 1) + en{t} 


the system is identified in the model structure 
0 =a 


M: y(t) +ay(t — 1) = u(t — 1) + e(t). 
(9.21) 


or 
$6) = —ay(t — 1) + ult — 1) 


using a quadratic prediction-error criterion. We thus obtain. in this case, the LS 


Hence 
ponc tme (9.23) 
Ey-(t — 1) 


To compute the covariances, square (9.20) and take expectation. This gives 
R (0) + aR, (0) + 2aoRy(1) = u + ào 


with the standard notation (2.61). (We used (8.27) for the evaluation of E.) Multi- 
plying (9.20) by y(t — 1) and taking expectation gives 
R,(1) + ao R, (0) = Ry, (0) = 0 


where the last equality follows, since u(t) does not affect y(t) (due to the time delay). 


Hence 
_ ‘a 
a =i) = Ro) (9.24) 
1 — a 
and 
Ago 1 — a2 
3 2 0 (9.25) 
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Example 9.2 Covariance of an MA(1) Parameter 


Consider the system 


y(t) = eg(t) + coeolt — 1) 
where {e)(t)} is white noise with variance Ay. The MA(1) model structure 
y(t) = e(t) + ce(t — 1) 
is used, giving the predictor (4.18): 
¥(tle) + c¥(t — 1c) = ev(t — 1) (9.26) 
Differentiation w.r.t. c gives 
W(t.c) + cyit — le) = —-¥t — Le) +t- 1) 
At c = Cp. we have 


1 
t, = ——e,(t — 1 gaa 
Y(t, co) T wa ) (9.27) 


If Cx is the PEM estimate of c. we have, according to (9.17). 


a Ao 1 1 ae cs 
Cov cy OS ee ae ae 9.28) 
"NN Eye) N 
S € M: @-independent General Norm £(¢) í 


Consider now the general criterion (9.12). with £(€. 0. t) = £(€) (no explicit @- and 
t-dependence). Assume that E£ (eo(t)) = 0. Then, under the assumption that 
S € M, we find after straightforward calculations from (9.13) and (9.14) that 


KOLEY Oo) W(t, 2 


RS E [Eeo] 
[Eere] 


Here €' and £” are the first and second derivatives of £ with respect to its argument. 

Clearly. for £(e) = 4e*. e’(e) = e, and €" (e) = 1, so x(£) = Ee (t) = àr 
for quadratic £. This confirms (9.17) as a special case of (9.29) and (9.30). It is 
interesting to note that the choice of @ in the criterion only acts as scaling of the 
covariance matrix. 
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Asymptotic Cramér-Rao Bound 


The Cramér-Rao bound (7.91) applies to all unbiased estimators and all N. It can 
be rewritten 


; -1 
F ft 
Cov (VN Gy = 65)) > Ko È 3 Eyt. oyi. J (9.31) 


With f (-) denoting the PDF of eo(t). it can be verified that 
K(—log fe) = Ko 


where xo is defined by (7.88) and the function «(£) by (9.30). We thus find that 
the asymptotic covariance matrix Pg in (9.29) equals the limit (as N — cc) of the 
Cramér-Rao bound if £(-) is chosen as — log fe(-). In this sense the prediction- 
error estimate (= the maximum likelihood estimate for this choice of £) has the best 
possible asymptotic properties one can hope for. 


Go € G: Quadratic Criterion, G and H Independently 
Parametrized (*) 


We now turn to the case of Theorem 8.4, where G and H are independently para- 
metrized and Go can be exactly described in the model set, but not necessarily so for 
Ho. Assume that all the conditions of Theorem 8.4 hold, and assume in addition that 


Dg(S.M) = {po} (9.32) 
and 
D, = {0} 0 = [a] (9.33) 
Let 
Ho(q) S 
Fla) = ——— = q 9.34 
(q) Hiq.n*) > fig (9.34) 


where H(q.n*) is the limiting noise model and Ho(q) the true one according to 
(8.7). We then have 


e(t.0") = Hq. Olt) — Gig. poutt)] = F(qeo(t) (9.35) 
and 


H~'(q.n*) G, (q. Poult) 


d 
cO: = = t,@ -—9* = i 
y(t } PT le 0 eo 


| (9.36) 


Moreover. with €(€) = le? 


V" (0*) = Ewtt.0*) it, 0*) — Elly t. 0") F(g)eo(t)] (9.37) 


Since {u(t }} and {eo(t}} are independent. the off-diagonal blocks of the matrix (9.37) 
will be zero (the blocks corresponding to the decomposition of @ into p and n). 
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Similarly. Q in (9.9) will be block diagonal and hence so will Pa. We now concentrate 
on the interesting block corresponding to p. Denote 


Volt) = H! (q. n“)G lq. poult) (9.38) 


which is a deterministic (ey-independent) quantity. Here G, is the gradient of G 
w.r.t. p. Then 


N N 

. 1 . 

Qp = a. (SJEF (qeo(t) - F(q)eo{s) 
N N 


Il 
T= 
g3 
z | 
i= 
iM 
i Me 
iM 

= 
= 
thy 

2 

| 

wa 

| 

fee 

J 
S 


N-t N-T 
=[t-i=s-j=t]= lim 1 OY five +i fiwle + j) 
t=! 7=0 j=0 
Define 
Rs nw 
Wer) = Do fiot +i) (9.39) 
i=0 
Then 
Op = MEY Ný A) (9.40) 
Similarly, for the upper left block of (9.37). 
(V'O], = Eyn ‘ (9.41) 


We can summarize the result as follows: 


Under the assumptions of Theorem 8.4 and (9.33) 
JN (pn — py) E€ ASN(0, Po) 


where 
P, = MEy OY OV TEVOW DEY y O 19.42) 


The estimates Oy and y are asymptotically uncorrelated. 


This result can be extended to more general norms €(€) analogously to the case 
SEM. 

One might ask what (fixed or estimated) noise model H (q. n*) minimizes P,. 
Not surprisingly, the result is H (q. n*) = Ho(q) so that F(q} = 1 in (9.34). See 
Problem 9G.6. 
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Example 9.3 Covariance of Output Error Estimate 
Consider again the system (9.20). now together with the model structure 


q7! 


M: y(t) = E 7 a + e(t). 6= f (9.43) 
or 
F q! 
¥(t|0) = 1+ fqn 


This is an output error model. Clearly, f = a will give a correct description of the 
transfer function, while the noise properties of (9.20) cannot be correctly described 
within (9.43). since 


Halq) = 


and A(qg.@) = 
1 + aq” (9.9) 


We are thus in a situation. to which the result (9.42) applies. We find that 


4 


a =q“ —1 
t.0) = —— yalo) = ult) = —— ult- 2) (94 
Wolt.@) apo ) a+ faa Ura ) (9.44) 
when evaluated at 9* = ap. For F(q) in (9.34). we have 
1/(1 + ag!) -i 
F(q) = Wta = = Xaa (9.45) 
Hence 
7 = i = ii 1 
y(t) = 2a) Yt +1) = PE q | Wott) = Pad Wolt) 
The spectra are 
u u 
P = ————_., = ——__— 
ne |1 + age’ |* mate) [1 + aget” 


from which the variances can be computed as in Appendix 2C. This gives 


= 1+ a5 =72 (1 + a6)(1 — aĝ) + 2aġ(2 — aş) 
Ey = pL, Ey (t) = p 1. 
welt) “Tap w(t) =H G- ay 
and thus the variance is, according to (9.42), 
oy a See i ml 2) A poy? a (9.46) 
ovf ~ —- = — a4 — : 
ON oe “Li +a} Vasa 


~ 
L= 
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Multivariable Case (*) 


Consider the case where e(f) is a p-dimensional vector, and 
1 
Vn(0, Z“) = — Y Efe(t. 0 
w@,Z%) = = 2, (elr. 8)) 


The formulas (9.9) to (9.14) are still applicable. Assuming 0* = @p is such that 
E(t, 09) = eg(t) isa sequence of independent. zero mean vectors with EE; (eol) = 0 
then straightforward calculations give 


Po = [Ew(t. @)Ew%t.60)] 


x [Ev t.0) EV E. 00)] [Ev (t. ZW. 6)] (9.47) 
Here w(t.) isthe d x p matrix (d/d9)$(t|0) and E and È are p x p matrices; 


= 
5 


EC (eo(t)) (9.48a) 
Ee (eo(t))[E (eo(t)) 7 (9.48b) 


x 
For the quadratic criterion (7.27). 
lle) = eT A'e 


we find that 
S = AT! 
and 
E = AAA?! 


where Ao = Eeo(t)ed(t). 
If the choice of weighting matrix A in the criterion is A = Ap we thus obtain 
from (9.47) 


Pp = [Ey t. oo) A5’ ytt, Q)] 


See also Problem 9E.4 and the discussion in Section 15.2. 


9.4 FREQUENCY-DOMAIN EXPRESSIONS FOR THE ASYMPTOTIC 
VARIANCE 


Variance When S € M 


We found in the previous section that the asymptotic variance is given by (9.17) when 
S € M. This can be expressed in terms of frequency functions if Parseval’s relation 
(or Theorem 2.2) is applied. First we need a convenient expression for W(t, @). 
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With T’ being the gradient of T = [G H] w.r.t. 8, we have from (4.121) and 
(4.124) 


1 ; 1 0 u(t) 
gj T'(q,8 50 
VO a NS tee ees een oan 


At 6 = 6, using €(t, 0%) = eo(t), we thus obtain 


W(t, %) = Hq, 8)T (q. 9) xo(t) (9.51) 


where xo(t) is defined by (8.14). Let ©, (œ) be the spectrum of xo (a 2 x 2 matrix): 


(9.52) 


bulo) = bee | 


®,,,(w) Ao 
Here ®,,¢(w) is the cross-spectrum between the input and the innovations (which is 


identically zero when the system operates in open loop). 
From Theorem 2.2 and Parseval’s relation we thus have 


Ev (t. 00)¥7(t, %) 


1 f7 . ee os 
= xf [HE D Tel, Oy) Py (w)T%e™, odo (9.53) 
-n 


Now, using 


Pulo) = Ao |Ho(e)|” 


for the noise spectrum, we find that 


R 1 a . a 
Covéy ~ —|— T (eè, 0) Te.. O)d 54 
OVE E J eO) (e ”, ODP p (w)T (e™”. Oo) o| (9.54) 


under the assumption that S$ € M and that a quadratic criterion is used. Recall the 
definition of T’ in (4.125). 


Variance when Go € G (*) 


The expression (9.42) can be given in the frequency domain by quite analogous 
calculations. We obtain 
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1 
—R~'QR! 
N Q 


1 f" 1 oe S 
A THE gop CE Pulo) e. po) dew 
1 x P,.(w) i iw ; ie T 
on |, Hee gpp rE OPGE", po) dw 
(9.55 


Variance of the Transfer Functions 


So far. we have discussed the covariance matrix of the parameters @. For linear, 
black-box models these parameters are only means to describe various input-output 
properties. like the frequency function, the impulse response, the poles and zeros. etc, 
It is of interest to translate the covariance matrix of the parameters to the covariance 
of these properties. 

First, let us consider in general how the covariance properties transform. Let 
the d-dimensional vector 6 be an estimate with mean 6o and covariance matrix P, 
We are interested in the p-dimensional random variable f (8). Asymptotically. as 6 
becomes sufficiently close to 6, we have with good accuracy 


fÊ) = fF) + fÊ — 4) 


where f’ is the p x d derivative of f with respect to 9. This means that we asymp- 


totically have j 


Covf(@) = E (sê _ Efô) (fê _ Efô)’ 
~ E(f — F) (FÖ — F) (9.56) 
~ f'(O)P (FON 


This expression is also known as Gauss` approximation formula. 
Let us now focus on the frequency functions: 


Gy(eé@) = Gii. ôy). Hy(el’) = Hle”, by) 


These are complex-valued functions, and in accordance with (1.13) we define 


NEN as Pe 
CovG(e'®) = E|G(e'®) ~ EG(e'”) (9.57) 
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and 
lei”) G el) G(ei”) 
Cov} = EÈ aA l -E| 

H(e'”) He”) H(e'®) 
= A ; T 
G(e™'” Gie™®) 

x 7 l ) =E . (9.58) 

H(e™'”) H(e™”) 


If the asymptotic covariance matrix of Ox. is 4 Pa. then Gauss’ approximation for- 
mula directly gives 


Gy tel) 1 T ; 
Cov] >, x —(T(e'”’.9 PT (e~. 0 9.5 
aae Ae E eae sen a 


where T’ is defined by (4.125). By plugging in the expression for Pg from (9.54). a 
nicely symmetric expression for the variance of the frequency functions is obtained. 


We can be more specific for certain common choice of model structures. Con- 
sider an ARX-model 


B(q) l 1 
Cope Hees 
ee Aa neS g 


with the same order of the A- and B-polynomials: A(g) = |+a,g7'!+...+an,q7". 
Big) = big! +... +b,q™”. Let & = [ay bp]. Then the derivative with respect 
to &. i.e. the rows 2(k — 1) and 2k — 1 of T’(g. @). will be 


d By daa | _ Aig) Atq) Se 
Ee toate | = eal = q*M(q.8) 
agd 0 
where the last step is a definition of M (4.0). This means that 
T'(ei2.0) = W,(e!”)M(e”. 0) (9.60a) 
_ Bee) l 
; Ae A-(ei 
M(e'®”.6) ie i } A-(e!") (9.60b) 
Aleit) 0 


Si ; ; : T 
Wile") = [ezit eiei -o eire] | (9.60c) 
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Here / is the 2 x 2 identity matrix. This shows how we can easily build T’ for arbi- 
trary orders n. It is also straightforward to show that all the black-box models in the 
family (4.33) obey the same structure, with different definitions of M. This matrix 
will have dimension s x 2, where s is the number of different polynomials used by 
the structure. i.e., 3 for ARMAX, 2 for OE and 4 for BJ. Moreover the identity matrix 
will then be of dimension s x s. 


Asymptotic Black-Box Theory 


The expressions (9.60), (9.59), and (9.54) together constitute an explicit way of com- 
puting the frequency function covariances. It is not easy, though. to get an intu- 
itive feeling for the result. It turns out, however. that as the model order increases, 
n — X. the expressions simplify drastically. The simplification is based on the fol- 
lowing result: Let W,, be defined by (9.60c) with the identity matrix of size s. and 
let L(w) be as x s invertible matrix function. with a well defined inverse Fourier 
transform. Then 


RX 1 
lim = W7(e) Esl Wate LIEW Tle" | W,, (el) 


un—> XxX n 


= [L(@) 780.0, (9.61) 


where ô; is 1 if the subscripts coincide and 0 otherwise. This result is proved as 
Lemma 4.3 in Yuan and Ljung (1985). See also Lemma 4.2 in Ljung and Yuan 
(19835). 

We can now combine (9.59) with (9.60) and (9.54): 


ple) ; , 
N - Cov ou | = MM (el. 0) Wie!) 
Hy (e'®) 


®,,(-§) 


f 
-1 
T, iE Ty -ië 
DE) M` (e°, By) }W,, (e a | 


x È J Wale M(E. 6) 
2n Jaz 


x W, (e!)M(e!. 9) 


y, . -1 . 
~ n- MMe, 6) | ce! Mre, J M(e7, Go) 
©, (a) 


= n®,.(w) Oy) (—w) 


where the approximate equality follows from (9.61). We complex-conjugated (9.54) 
first, which is allowed since the parameter covariance matrix is real. In summary we 
have 


G N (e”) | 


on Pulo) Duel) 
~ Peal ony ho | 


x (9.62) 


Hy (e) 
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where n is the model order and AN is the number of data. and where the approximation 
improves as the model order increases. Note that (9.61) also implies that estimates 
at different frequencies are uncorrelated. We showed the result here for the ARX- 
model, but the only requirement on the model structure is the shift property (9.60). 
This holds for many black-box parameterizations, including the family (4.33). The 
result (9.62) is thus fundamental. However, note that it is asymptotic in the model 
order n . and corresponds to the case where the model structure is so flexible that the 
estimates at different frequencies become decoupled. 


In open loop operation, where ®„e = 0, we find that the estimates Gi and 


Hwy are asymptotically uncorrelated. even when they are parameterized with common 
parameters. Then 


mG n (w) 
CovGy (el?) x ————— 9.6 
ovG x (e'®”) N D0) (9.63a) 
CovÂxy (et) x ~1Ho(e"®)/ (9.63b) 


In Part IH we shall find (9.62) to be of considerable use for many issues in 
experiment design. It is therefore of interest to check how well (9.62) describes the 
exact expression for moderate values of n. 


Example 9.4 Comparing (9.62) with Exact Expressions for Finite 7 
Consider a second-order system 
y(t) + ayit — 1) + agy(t — 2) = bult — 1) + Bult — 2) + eo(r) (9.64) 
We assume that this system is identified using the least-squares method in a model 
structure given by 
yt) tayy(t—-1) +--+ +any(t—n) = bu(t —1)+---+b,u(t—n)+e(t) (9.65) 


The input {u(t)} is taken as white noise with variance 1, and the disturbance eo(f) 
to the system (9.64) is also supposed to be white noise with unit variance. 


Let Gy (e'”, n) be the estimate of the transfer function obtained with model 
order n. Since the true system (9.64) is of second order, we can use (9.54), (9.59), 
and (9.60) to evaluate P,,(@) in 


A i 1 
CovGya(e”. n) ~ ra P,,(@) 
for n > 2. This gives an explicit, but complicated expression for P,(@), which is 
exact in n (but asymptotic in NV). According to (9.63a) we have, asymptotically in n. 


n lwo) _ n/N 


Cov Ĝ (it n) ~ ZP c 
y Lo = y Pulw) |1 + ae + alete) 


(9.66) 


We shall thus compare how well » - P (œ) approximates P,. We do this by plotting 
Pi (œw) and (1/n)P,(w) as functions of œw. First consider the “Åström system.” 


a snl S a IS (9.67) 
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In Figure 9.1(a) we plot log(1/)P,(w) for n = 2.10, and log Pz (œ) against log w. 
The figure illustrates well the convergence to the limit as # increases. The conver. 
gence could be said to be rather slow. Of greater importance, however. is that. even 
for small 7, the limit expression gives a reasonable feel for how the exact variance 
changes with frequency. To test this feature. we evaluate the expression for a num- 
ber of second-order systems with different characteristics. The results are shown in 
Figures 9.1(b) to 9.1(d). 

According to the asymptotic expression (9.62). the variance should not depend 
on the number of estimated parameters, but only on the model order. This is a 
slightly surprising result. To test whether this result also has relevance for low-order 
models, we identified the system (9.64) and (9.67) also in the ARMAX-structure. 


yt) + alt — 1) + ary(t — 2) 
= bu(t — 1) + bou(t — 2) + elt) + ciet — 1) + celt — 2) (9.68) 


which employs 50% more parameters than (9.65) with the same model order. In 
Figure 9.1(e) we show the variance of the transfer function when estimated in the 
model structures (9.65) and (9.68). respectively, with n = 2. The agreement is 
striking. 

Finally, we identified the svstem (9.64) and (9.67) in the ARMAX model struc- 

ture (9.68). with a low-frequency input with spectrum 
D, (Œw) = sae (9.69) 

1.25 — cosw 
The results are shown in Figure 9.1(f). Comparing with Figure 9.1(a), we see that the 
agreement between the asymptotic and exact expressions is now worse, especially at 
low frequencies. T 


We may conclude from the example that the asymptotic variance expressions 
give a good feel for character of the true variance, but must be used with care for 
quantitative calculations. 


9.5 THE CORRELATION APPROACH 


Basic Theorem 


The correlation estimate Êy is defined by (7.110). We shall confine ourselves to the 
case studied in Theorem 8.6, that is, æ (€) = € and linearly generated instruments. 
We thus have 


ôy = soll fx(@,Z*) = 0] (9.70a) 
1 N 
Ny — 
fv@.Z*) = eM ee) (9.70b) 
er(t.0) = L(qg)e(t.@) (9.700) 


t(t.0) = K,(q.0)y(t) + Ku(q.@)u(t) (9.70d) 


0. 


0.5 


0.5 


0. 


O. 
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Figure 9.1 Plots of log(1/n)P,(w) and of log Pz (w) versus w for different 
systems. model structures. and inputs. Thick line = the asymptotic expression P,. 
Thin lines: Normalized true variance for n = 2 (solid) and n = 10 (dashed). 

(a) System: (9.64). (9.67): model: (9.65); input: white noise. 

(b) System: (9.64). a} = —0.8. a? = 0.2, b? = 1. b? = —0.9: model: (9.65): input: 
white noise. 

(c) System: (9.64), a? = —1.8. af = 0.81. b? = 1. bY = 0: model: (9.65): input: 
white noise. 

(d) System: (9.64). a? = —1.4, a? = 0.98, b? = 1. b? = 0.5: model: (9.65): input: 
white noise. 

(e) System: (9.64). (9.67); Thick line: model (9.65). Thin line: model (9.68). n = 2. 
Input: white noise. 

(£) System: (9.64). (9.67): model: (9.68); input: (9.69). 
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By Taylor’s expansion, we obtain 


0 = fln. ZY) = fu(O*,Z%) + (On — O) fyen. Z“) (9.71) 


This is entirely analogous to (9.3) with the difference that y(t, 6*) in V,(0*. Z`) 
is replaced by ¢(t, 0*) in fy(0*. Z). The analysis of (9.71) is therefore essentially 
identical to the case already studied in Section 9.2. and the result can be formulated 
as follows. 


Theorem 9.2. Consider by determined by (9.70). Assume that e(t, 0) is computed 
from a linear, uniformly stable model structure. and that 


d 
(x. (4,0). Ku(q. 9). = Zx SCA 9). 7 7a Ke (4.0);0 € Dai} 


is a uniformly stable family of filters. Assume that the data set Z™ is subject to D1 
(see Section 8.2). Assume also that @y — 0* wp. las N — oc that f'(6*) is 
nonsingular [f defined by (8.87)], and that 


VNEfy(0*.Z%) + 0, asN >œ 


Then 
VN (Ôn — 0*) € AsN(O, Ps) (9.72) 
with 
Py = [FON F O (9.73) 
Q = lim N- Efx(8", 2“ ez (9.74) 
We find from (9.70) that f 
FO) = Ett. Oer(t.6) — Ett [Ligywit, oT (9.75) 


where w as before denotes the negative gradient of £, with respect to 8. 


Variance Expression when S € M, L(q) = 1 (*) 


Under the assumption S € M, there exists a value 0 such that é(f, 0g) = eọ(t} is a 
sequence of independent random variables with zero mean values and variances Aj). 
We thus obtain, for L(g) = 1, 


F (%) = Ett. 0h (t, 00) + Ett. Oeo(t) = Egt. %)WX(t, Oy) (9.76) 
and 


Q = lim EX 1E Sze, Boyeo(tep(s)t"s, 80) 


t=] s=1 


ADEC(t, O)o"(t, Oo) (9.77) 


tl 


so 
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Po = Ay | Ecc. Oo) wi (t. e)] 


x [Ezt 0t a, o] [ Eve. E71. O)] (9.78) 


See also Problem 9G.4. 


Example 9.5 Covariance of Pseudolinear Regression Estimate 


Consider again the system and the model of Example 9.2, but suppose that the c- 
estimate is determined by the PLR method (7.113): that is. 


N 
1 
APLR _ _ = 
Cy = sol l J é(t — 1. celt, c) = | 


z=1 


Here ((1.6) = g(t, 0) = e(t —1l.c) = v(t —1) — (t — lic). =c. Atc = cg we 
have [see (9.27)] 


1 
(t.o) = eot — 1), = Wt.) = Tagger ~—1) 
Hence _ 

Et (t, 0o) CE, 0) = do 

Eg (t, OWT, 0) = ào 


so, according to (9.78). 


1 

“PLR 

Cove, ~N — 
N N 


Compare with (9.28). Note that [co] < 1 always. Jg 


Instrumental-variable Methods: Go € G 


It is of interest to specialize Theorem 9.2 to the instrumental-variable case of Section 
7.6. We then have the model 


ŝ(l0) = yg (t)6 (9.79) 


and the procedure (7.129). Suppose now that the true system is given as in (8.92) to 
(8.94) by 
v(t) = p70 + Ho(q)Ao(qeo(t) (9.80) 


where {ey(t)} is white noise with variance Ay. independent of {u(7)}. Then 
Er (t, 6) = L(q) Hol(g) Aolg ent!) 


is independent of {u(t)} and hence of {(¢. Oy) if the system operates in open loop. 
Thus 6 is a solution to _ 
EG(t, @er(t,6) = 0 
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as we also found in (8.95). To get an asymptotic distribution, we shall assume it is 
the only solution in Dat: 


Ecit,@)er(t.0) =0 and O€ Dy > 0 = (9.81) 


Introduce also the monic filter 
Xx 
F(q) = L(q)H(q)A0(q) = X fig (9.82) 
i=0 


Inserting these expressions into (9.72) to (9.74) gives, by entirely analogous calcula- 
tion as in (9.42), the following result: 


For the IV estimate by we have. under assumptions (9.80) and (9.81). that 


Py = [Et (t, 0oy O Etr, OEE t. ONES. OIOE(DIW? (9.83) 
with 
g(t) = L(q)y" (1) 


x 
br(t.8) = D> fist +i) 


i=0 


Example 9.6 Covariance of an IV Estimate 


/ 
Consider again the system (9.20) of Example 9.1. Let the model be the linear regres- 
sion (9.21) and let a be estimated by the IV method using 


q 


ct) = ragi 


as the instrument and L(q) = 1. To evaluate (9.83), we find by comparing (9.80) 
and (9.20) that 


F(q) = Ho(q)Ao(q) = 1. 


Hence 


ert) = -yt - 1) = EEEE 


tF) = (t) 
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and 
Etrer() = = gees tb (wd 
serlo) = dn J 1 tae’ lLta,eie " men 
H f 1 1 H 
2m Jr 1+ aet 1 + a,e 1 — apa, 
— > ] Gi 
Eċ a) = — — © d 
T 
1 
2m J-a |l + ae] 1-a; 
Hence ; 
A 1 Ag (1 — aga.) 
Covay ~ —————— 9.84 
NN Gee ees 
Oo 


Frequency-domain Expressions for the Variance (*) 


Frequency-domain expressions for the covariance matrices in (9.78) and (9.83) can 
be developed along the same lines as for (9.54) and (9.55). For example. for (9.83) 
we find that 


1 
— RO RTT 
N Q 


as K, (e) © KE (e™™®) - L(e”)@,(w)dw (9.85) 
2x J-a 


À T ; ; f 
— | Kalet). Ki (e!) -Fleto lwd 
2n Jug 


Here K,, and F are given by (7.128) and (9.82), respectively, while 


Bo) -) 
Ay(q) 
Bola) n, 


Ko(q) = Ao(q) 
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Note that 
y(t) = Ko(q)u(t) + e9-dependent terms 
An “asymptotic black-box analysis” as in (9.62) can be developed also for IV and 


PLR estimates. The result is that. for open-loop operation, (9.63) still holds. See 
Ljung (1985c). 


9.6 USE AND RELEVANCE OF ASYMPTOTIC VARIANCE EXPRESSIONS 


In this chapter we have developed a number of expressions for the asymptotic co- 
variance of the estimated quantities. Now, what are these good for? There are two 
basic uses for such results. One is to use the covariance matrix Pg. (or some derived 
quantity) as a quality measure for resulting estimates. Our expressions can then be 
used for analytical or numerical studies of how various design variables affect the 
identification result. We shall make extensive use of this approach in Part III when 
discussing the user’s choices. 

The other application of the asymptotic covariance results is to compute con- 
fidence intervals to assess the reliability of particular estimates obtained from an 
observed data set. We shall comment on this application shortly. 

The derived expressions are asymptoticin N , the number of observed data. Our 
theory has not told us how large N has to be for the results to be applicable. Clearly, 
this is an important question in order to evaluate the relevance of the covariance 
expression. 


Confidence Intervals 


If the vector Êy obeys 


j 
VN (Ôn — bo) € AsN(0. Po) (9.86) 

then the kth component obeys 
NOC — 0) © As N0. PŽP) (9.87) 


Pa being the k, k diagonal element of Py. This means that the probability distribu- 


tion of the random variable y N (O° = ei) converges to the Gaussian distribution. 
and we can use the limit to evaluate, for example, the probability 


JN i 
ee e 
lox pe |x| >a 


In this way we may form confidence intervals for the true parameter: that is. we 
can tell, with a certain degree of confidence, that the true value gs" is to be found 
in a certain interval around the estimate (provided the underlying assumptions are 
Satisfied, that is). 


a 2 ae kk) 
Pa? — 6 | >a TIN/2Ps dx (9.88) 
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The vector expression (9.86) tells us more. If a random vector 7 has a Gaussian 
distribution 


n € N(O. P) 
then the scalar 
z=n Py 
has a x? distribution with dim 7 = d degrees of freedom, 
z € x7(d). 


From (9.86} we thus draw the conclusion that 
= N - (Ôv — 6)" Py (By — 0) € Asx(d) (9.89) 


That is, ny converges in distribution to the x°(d) distribution as N tends to infinity. 
Using x? tables. we can thus derive confidence intervals for nx. which correspond 
to confidence ellipsoids for Ox (see also Section 11.2). 


Estimating the Covariance Matrix 


The expressions (9.86) to (9.89) require the knowledge of Po to be used in practice. 
Typical expressions for Pg, assuming . S € M. such as (9.17), (9.42). (9.78), and (9.83) 
involve both O), Ay and the symbol E. and as such are unknown to the user. Natural 
estimates Py of Pg are, however, straightforward, replacing 6) by Ôn and E by the 
sample sum, as in (9.18} and (9.19). Under weak conditions. Theorem 2.3 will then 
imply that Py converges w.p.1 to Pa, so itis still reasonable to apply the asymptotic 
results with Py replacing Py. (If 6x has an exact Gaussian distribution for finite N, 
more sophisticated calculations, involving ż- and F-distribution, can be carried out 
to find more accurate nonasymptotic, confidence intervals. See Section II.2.) 

It could be noted that it is more cumbersome to find estimates of Py in the more 
general case when $ g M. Replacing 6* by Ôn in (9.9), gives, trivially. Q = 0. which 
is a useless estimate. Hjalmarsson and Ljung (1992)have shown how to circumvent 
this problem, by devising a consistent estimator of Q also in this case. For the output 
error case, note that the expressions (9.42) and (9.83) contain the noise filter Ho(q). 
which typically would not be known to the user. 


Relevance of the Asymptotic Expressions for Finite NV 


From the asymptotic expressions we know the behavior of JN (Ay — ĝo) for “large” 
N. This information is, as we saw, crucial for the confidence we develop in the model. 
The question remains, how large N has to be for the asymptotic expressions to be 
reliable. No general answer can be given to this question. Monte-Carlo studies have 
been performed, in which systems have been simulated many times with different 
noise realizations. The statistics over the estimates so obtained have been compared 
with the estimated covariance matrices, (9.18)-(9.19). Such studies show that for 
typical system identification applications, the asymptotic variance expressions are 
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reliable within 10% or so for N = 300. Such Monte-Carlo tests depend however also 
on the properties of the random generators used. Despite the slight vagueness of 
these statements, the conclusion is that the estimated asymptotic variance expression 
gives a reliable picture of the actual variability and uncertainty of the estimates. 

An alternative picture of the estimate uncertainty. that does not rely upon ana- 
lytical, asymptotic expressions is obtained using bootstrap techniques. The bootstrap 
principle starts from the Monte-Carlo idea just described: If we know how our data 
have been generated. we may regenerate them many times and compute the statistics 
over the resulting estimates, to determine their reliability. Now. in practice. we do 
not know the mechanism behind the data generation. Instead, the only information 
is the data set Z** itself. The bootstrap idea is now to generate new data sets that 
resemble Z* in the sense that they are drawn from the empirical distribution that 
the observed data Z* defines. Monte-Carlo studies over such collection of data 
sets will now yield confidence intervals and covariance properties. See e.g. Efron 
and Tibshirani (1993)for a thorough treatment and Zoubir and Boashash (1998) and 
Politis (1998)for tutorials focused on signal processing applications. 


9.7 SUMMARY 


In this chapter we have shown that the parameter estimates obtained by the pre- 
diction error or the correlation approach are asymptotically normal distributed as 
the number of observed data tends to infinity. This was established in Theorems 9.1 
and 9.2. Several explicit expressions for the asymptotic covariance matrix have been 
developed under varying assumptions. The archetypal result is (9.17): 


a À = r 
Covéy = S Eyt. ODW E. o! (9.90) 


which is valid for a quadratic prediction-error criterion estimate under the assump- 
tion that S € M. It tells us that the covariance matrix is given by the inverse of the 
covariance matrix of the predictor gradients y%. normalized by the innovations vari- 
ance Ay divided by the number of observed data N. This expression is also the limit 
of the Cramér-Rao lower bound. in case the innovations are normally distributed. 
This means that the estimate Ôw then is asymptotically efficient. 

The corresponding expression for the correlation approach is (9.78): 


A ào — 
CovOy ~ STEZE, Yt, 60] 


x [E (t. Oo t, HINES (t, OW E, O)] 7 (9.91) 


We have also developed results for the asymptotic distribution of the resulting 
transfer-function estimates in black-box parametrizations. They tell us that 


Aloi Â., = -1 
Cav G(e' . Ox) ee n . ,(w) Palo) Puel—w) (9.92) 
H(e'®. On) N Dye(w) Av 


asymptotically both in model order n and number of data N. 
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The distribution of the random vector Ay stresses that the result of the param- 
eter identification phase is not a model: rather it is a set of models. Guided by the 
data, we have narrowed the original model set to, it is hoped. a much smaller set of 
possible descriptions around M(6x). Expressions (9.90) and (9.91) form a “quality 
tag” with which the model M(6x) is delivered to the user. 


9.8 BIBLIOGRAPHY 


The asymptotic distribution of parameter estimates is, as we have seen, basically 
an application of a suitable central limit theorem (CLT). The probability literature 
offers an abundant supply of CLTs under varying conditions (see, e.g., Chung, 1974; 
Brown, 1971; Ibragimov and Linnik. 1975; Withers. 1981; and Dvoretzky, 1972). A 
key assumption in these results is that the dependence between samples far apart 
should be decaying at at least a certain rate. Mixing conditions are used to describe 
this. In our treatment. the stability assumption (8.5) played this role. 

The asymptotic normality of ML estimates in the case of independent observa- 
tions is discussed in, for example. Kendall and Stuart (1961). This result was extended 
by Astrém and Bohlin (1965)to ARMAX models. 

In the statistical literature, asymptotic normality for parameter estimates in 
ARMA models of time series has been treated in a number of articles. Let us in 
particular mentton the work of Hannan (1970, 1973, 1979), Dunsmuir and Hannan 
(1976), Hannan and Deistler (1988), and Anderson (1975). 

The situation where S ¢ M was studied in Kabaila and Goodwin (1980)and in 
Ljung and Caines (1979). Kabaila (1983)has studied the OE-result (9.42). 

Frequency-domain expressions for the covariance matrices, such as (9.54). have 
been used in, for example. Hannan (1970)and Kabaila and Goodwin (1980). The 
black-box expression (9.62) has been derived in Ljung (1985b). Results where the 
order n is a function of the number of data N and increases to infinity are given for 
FIR models in Ljung and Yuan (1985)and for the ARX model (4.7) in Ljung and 
Wahlberg (1992). Related results for AR modeling of spectra were obtained by Berk 
(1974). 

Extensions of the results (9.62) to the multivariable case are given in Zhu (1989). 
and to recursively identified models in Gunnarsson and Ljung (1989). 

The asymptotic normality for IV estimates was shown by Caines (1976b)and 
a comprehensive treatment is contained in Söderström and Stoica (1983). For PLR 
methods, results are given by Stoica et.al. (1984). 

Discussions of the use of confidence intervals are standard in many textbooks 
(see, e.g., Draper and Smith, 1981). 


9.9 PROBLEMS 


9G.1 Consider the asymptotic expression (9.62) and let ®,.(w) = 0. Note that the real 
and imaginary parts are uncorrelated. (This follows as in (6.62) since the estimates are 
uncorrelated at different frequencies.) Let 


Ax (w) = IG (e')|. Qx(w) = argG x (e) (radians) 
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9G.2 


9G.3 


9G.4 


9G.5 


9G.6 


Show that TE 
Cov Àx (w) ~ A R 
2 N D, (w) 
1 >. 
Cov gy(w) ~ Pes ee 
2N ,(w)|Gole)? 
Show also that 
a iwn2 ~ i 2 n P(w) 
EG alet) — |EGule') ~ N Plo) 


From a model H (e'”, @y) and an estimate Ay of the innovation variance, we can form 
an estimate of the noise spectrum 

2 N — E iw Â 2 

Di, (w) = in| H (e, @x)I 


Use the asymptotic expression (9.62) to show that 
ay 2n 
Var È" (w) ~ AI w Æ 0,27 


in case ®,,(w) = 0. (Note that the variance of Ky does not increase with the model 
order n; cf. Problem 9E.1.) Compare with (6.75). 


Suppose that S g M but that S “is close to” M in the sense 
e(t, 0") = eo(t) + plt) 


where {e9(t)} is white noise with variance Ag and Ep? (t) = 07. Show that the matrix 
in Theorem 9.1 then can be written 
Po = MEYE, OY. 6")! + Ro) 
where 
IIR < C-o : 
What quantities does the constant C depend on? 


Consider the correlation estimate Ôn defined by (7.110) for a general, differentiable 
function a(-). Use the expressions (9.73) and (9.74) to show that, in case $ € M. the 
asymptotic covariance is given by (9.78), with Ag replaced by 


Ela(eo(t))]? 

[Ea’(ep(t))/? 
Show that the estimate ÀN in (9.19) is asymptotically uncorrelated with Ên in case 
Se M. 


Hint: Use the parameterization of (7.87). Note that the minimizing À is (9.19). 
Note also that (9.15) is applicable to arbitrary criteria (9.12). 


Consider the result (9.42). Show that 
P, > Al Ev (DYD 


and that equality is obtained for F(q) = 1, i.e.. H (q. n*) = Ho(q). 

Hint: Use $; Wo(t)Wi(t) = YTY. for a suitably defined matrix Y, and 
similarly for Ų. Note that Ọ = FW for a suitably defined matrix F. Then apply 
Lemma II.2. 
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9G.7 The result (9.61) can be extended as follows. Let 


9E.1 


9E.2 


1 a ; , 
R,(A) = >> f Wale JAWI (e7® dé 
with W, defined by (9.60), and similarly for R„ (B). Then 
1 P , 
lim = Wa (€"°) Ra (A) Rn(B) Wale) = A(w)B(w) (9.93) 


(see Ljung and Yuan, 1985). Use this result together with (9.55) to prove that (9.63a) 
still holds if an output error model is applied, despite the fact that a noise model is 
not estimated. (This shows that asymptotically, as the model order tends to infinity, 
estimating a correct noise model gives no gain in the accuracy of G. This may be 
counterintuitive. and is not true for finite n. The reason is. however. that the asymptotic 
results refer to the case where the estimates at different frequencies are decoupled. Then 
there is nothing to gain by using information at different frequencies, weighted together 
according to the true noise model. to form the estimate of G.) 


Consider the case with a @-parametrized criterion function 
£(€, 0) 


Assume that S € M, and derive an expression for the asymptotic covariance matrix 
using (9.9) to (9.14). Apply this expression to 


= N A = 

8 1 1 é*(t. 6 1 

a = arg min = 5 + 5 98A 
An 2 = 


Assume that €(f. 09) = e(t). where eo(t) is white noise with Ee (t) = Ao. Eeg(t) = 0. 
and Fe j(t) = fo. Show that 
; 1 È 
Cov hn ~ = (po — 23) (9.94) 


[We know from Problem 7€.7 that the minimization gives 
le 
;: = — e° i 6 7 
N= 2 (7. On) 


so (9.94) gives the variance of the estimate in (9.19). Compare also (II.73).] 
Consider a signal given by 
y(t) — aoy(t — 1) = ealt) 


where {e9(t)} is white noise with variance A). The k-step-ahead predictor model struc- 
ture 
tilt — koa) = a y(t — k) 


is used (9 = a). Determine the asymptotic variance of ây. Which k minimizes this 
variance? (Try k = 1, k = 2 if the general case seems too difficult.) 
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9E3 


9E.4 


9E.5 


9D.1 


9D.2 


Consider Example 9.1, and assume that the following ARMAX model structure is used: 
M: y(t) tavt — 1) = ult — 1) + et) + celt — 1) 
6=[a cf 
Let @ be estimated by a quadratic prediction-error method. Show that 


= Ao 1 - a; 

Covay ~ ———> 

N u + hous 

What is the variance if instead the PLR method (7.113) is used? 


Apply the multivariable expression (9.47) to the quadratic criterion (7.27) L(e) = 
te’ A'e, assuming Ao = Eeo(t)eg(1). Let P(A) denote the resulting covariance 
matrix, and show that 


Pah) = [EWA y [EWA AAW EVA T YT)! 
Use Problem 7D.8 to show that 


Pa(A) > Pa(Ao). for all symmetric positive definite A 


Let 


N 
1 
Vy(0. Z") = det = X elt. O)e7r, 8) 


f=1 


and define Êy = arg min Vy (0. Z N), Apply the formulas (9.9) and (9.11) to determine 
the asymptotic covariance of Ôx, assuming S € M and Eesel) = Ao. Compare 
with Problem 9E.4. 


Suppose that we are using time-varying norms ¢(é.f) in/the criterion function [as in 
(7.18): assume no 6-dependence. though]. Show that if $ € M and E£; (ealt). t) = n 
for all ¢, then (9.29) and (9.30) still hold with 


E[t teot). oF 
[E E; (eo(t). OF 


lims $ Dra SIE. DF fex. dx 
[timy Wer S ED fel, ndx| 


K(f) = 


where f-(x.t) is the possibly time varying PDF of e(t). 
Let {e(1)} be a sequence of independent random variables with zero means, and let zt? 
depend on e(s).s < t — 1. Thus z(t) is independent of e(s).s > t. Show that 


Ez(t)z(s)e(He(s) = Ez(t)z(s) - Ee(t)e(s) 


Suppose that. in Theorem 9.1, the condition VN Dy — 0 does not hold. Define a 
nonrandom parameter value @;, suitably. and show that the result of Theorem 9.1 then 


holds when applied to /N (Ôn — 0x). 
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APPENDIX 9A: PROOF OF THEOREM 9.1 


In this appendix we shall employ a technique for proving asymptotic normality that is 
quite useful when dealing with signals related to identification experiments. The idea 
is to split the sum (9.8) into one part that satisfies a certain independence condition 
(M -dependence) among its terms and one part that is small. When studying these 
parts. the following two lemmas are instrumental. 


Lemma 9.Al. (Orey. 1958: Rosén. 1967). Consider the sum of doubly indexed 
random variables xx (k): 


Zs = J xs) (9A.1) 
t=] 
where 
Exy(t) = 0 (9A.2) 


Suppose that 
{(xx(1) -xa (s)} and fxn(t) xan (t + 1).--+. xn ()} (9A.3) 
are independent if t — s > M, where M is an integer, that 
N 


lim sup È` E|xy(k)/? < 90 (9A.4) 
Nox k=1 
and that is 
yan, 2 Blew ea =0 somed > 0 (9A.5) 
Let 
Q = lim EZZ (9A.6) 
N> 
Then 
Zn € AsN(0. Q) (9A.7) 


Proof. See Orey (1958)or Rosén (1967). Here we note that (9A.5) (Lyapunov’s 
condition) implies Lindeberg’s condition, which is used in the quoted references. 


A sequence {xy (k})} subject to (9A.3) is said to be M -dependent. Z 


Lemma 9A.2 (Diananda, 1953; Anderson, 1959). Let 
Sy = Zy(N) + Xy(N). M,N =1.2>-: 


such that 
EX4; (N) < Cy, lim Cy = 0 (9A.8) 
` M7 
P{Zy(N) < 2} = Fun(z) (9A.9) 
lim Fyu.n(z) = Fy (sz) (9A.10) 
Now 


lim Fy(z) = F(z) (9A.11) 
Moo 
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Then 
lim P{Sy < z} = F(z) (9A.12) 
Noo 
Proof. See Diananda (1953)or Anderson (1959). = 


We now turn to the proof of Theorem 9.1. Write for short Y(t) = W(t. 4"), 
E(t) = e(t. 6*) and let 


iy se 
Sy = Ta Lose — Eyitye(t)) (9A.13) 


Then. according to (9.6) and (9.7), 


1 


=V,, (672) = —= « Sy + De 9A.14 
nA ) VN N N ( ) 
From (8.19), 
x x 
e(t) = X aP (krit - k) + ġdi” (k)eolt — k) 
Similarly. 
[oo x f 
YO = od ra — k) + Dod elt — k) 
k=1 k=0 
Here 


x 
lds (k)| < Be allt.i, and XOA < o (9A.15) 
1 


according to the assumptions of uniform stability. Now let 


oc M 
M(t) = Sod (Writ — ky + > d’ kelt — k) (9A.16) 
k=1 k-0 


0 


et) -ens Y d (k)eg(t — k) (9A.17) 
k=M+1 


EM (7) 
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Define y” and w(t) analogously. Now 
Sn = ZM(N) + X(N) (9A.18) 


where 


4 1 
Zu(N) = (avon - Eve") (9A.19) 
a Tn 


X(N) = FD (New) + yE) 


SE CHORO + ynan )} (9A.20) 


We shall apply Lemma 9A.2 to (9A.18). First we show that Z m (N ) is asymptot- 
ically norma! according to Lemma 9A.1. The terms of Zy(N) are clearly zero mean 
and M -dependent by construction. Thus conditions (9A.1) to (9A.3) of Lemma 9A.1 
are satisfied. For (9A.4) and (9A.5), we find that 


2+6 


E| (we (eM (t) — Ey” (tye (e) 


—i 


—$/? 1/2 
N 6/2 (Ely yt? g Eje” (1+) 


< — .- 
TN 
1 
< — 


v N eC (9A.21) 


where the last inequality follows from the fact that Y% and e™ are finite sums of 
random variables with bounded (4 + 5) moments. The expression (9A.21) proves 
both (9A.4) and (9A.5). With 


Ou = lim EZy(N)Z)(N) (9A.22) 


it thus follows from Lemma 9A.1 that 


Zm(N) € AsN (0, Om) (9A.23) 


Consider now the term X y in (9A.20). Let 


M 
VNXW N) = So [4 Den — EF" ew] 
t=1 
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From the corollary to Lemma 2B.1, we have 


E|\VNXy (NP < C+ Cyn C2 +N 
where 
xX x 
Cau = D> supl SO Ar 
komo ! k=M41 
and 


oc x 
Ce = > sup ld) < DOB < C 
k= ' 1 


A similar bound applies to the other terms in Xm(N). Hence 


x 2 
E\Xu(N)P <c] > J 


=M+1 


which tends to zero as M tends to infinity, since the sum over g is convergent. 
Lemma 9A.2 now tells us that the asymptotic distribution of Sy is given by (9A.23) 
in the limit M — oc. Hence 
Sw E€ ASN(0. Q) 
A 
Mx 


Since VN Dy — 0, the asymptotic distribution of vN V;,(6*. Z”) coincides with 
that of Sx; that is, f 


VNV,(0*. Z") € AsN(0, Q) (9A.25) 
Also. VN Dy — 0 together with 
lim lim E|Xy(N)|? = 0 
MasseN 560 | mÍ )| 
implies that 


Q = lim lim EZø(N)ZĻ(N) 


l 
Moc N> 


im N- EV, (0".Z%) - [V,(@*, Z] (9A.26) 
OC 


We have now completed the proof of (9.8) and (9.9). It remains only to verify (9.4). 
We have 


na oN 1X 
Vy(@.Z%) = T Dw, oyto) — y'(t.0)e(t.0)] (9A.27) 


t=] 
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Since the model set and the system are uniformly stable, we find, as in (8.19) and 
(8.20). that €. Y. and w’ are obtained by uniformly stable filtering of white noise. 
Hence, as in Lemma 8.2. Theorem 2B.1 implies that 


sup |V¢(6, Z) — V"(@)| > 0. wp.lasN > 2 (9A.28) 
6EDa 


Consequently, since Oy — @*, 
Valé&y, Z) > V0"), wp. lasN > œ% (9A.29) 


if &y belongs to a neighborhood of 6* with radius [Êy — 6*|. [Technical note: The 
application of the Taylor expansion in (9.3) actually may give “different Ex” in dif- 
ferent rows of this vector expression. The result (9A.29) is, however, not affected by 
this. | 
Now 
VN (Ês — 0") = [Vi (Ew, ZVN V (0%, Z“) 


and (9A.29) together with (9A.25) complete the proof of Theorem 9.1. 


Remark. If Ôx is on the boundary of D. (9.3) does not apply. However, it 
follows from (9B.11) that this event has a probability that decays sufficiently fast not 
to affect the asymptotic distribution. 


APPENDIX 9B: THE ASYMPTOTIC PARAMETER VARIANCE 


The asymptotic distribution result of Theorem 9.1 does not necessarily imply that 
Cov(/NOy) = N - E(6y — Eĝy)(Ôn — Eby)’ > P) asN — œ (9B.1) 


with Pg given by (9.11) and (9.9). or that the left side of (9B.1) even exists. In this 
appendix we shall deal with this question. 
We introduce the following notation: 


Vy(@,Z™), Oy and 6% as in (9.1) to (9.3) 


Vu) = EVn(@, Z) 


Ox = arg min V p (8) 
QEDm 


ðn = Eon 
We then have, as in (9.3), 


Ôn — OF = [Vp (En, ZY Vy OR, Zy) (9B.2) 
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The assumption of Theorem 9.1 that v"(e*) > 0 together with the continuity of 
V"(@) implies that there exists a 6 > O such that 
V"(@) > ôl. W0. le — 8*| < ô (9B.3) 


Similarly. the assumption that 9* is the unique minimizing point implies that there 
exist ô > 0. 6* > 0 such that 


6 = — 
1a, —@"| < 5 => V(™,) < V(8) — & Ve. la -—O6"| > 8 (9B4) 
(The deltas in (9B.3) and (9B.4) are as such unrelated: for convenience they are taken 
to be the same, without loss of generality.) Introduce the following subsets of the 


event space: 


N 
Q 1 


|. 5 
ray —6*| < -| (9B.5) 


; A ; ô ô 
ay = fool view. 2") 2z ae Lally, ly — @*| < | (9B.6) 


Let 2, be the complement events. Clearly, in view of (9B.4). 
=N , = ô” 
T c= f sup | Vx (8. Z™) — V(@)| > =| (9B.7) 
o 
=N i at 7 wre ô 
T co ON = g sup | Vg 0, Z) — V" > | (9B.8) 
o 


(w here is the elementary event variable, of which the random variables Z* and 4x 
are functions.) Let P(Q) denote the probability of the event QQ). Then P(Q? al 
as N > oc since by — @*.wp.l. and P(Q) — las N — x since (9B.3) holds 
and V (0. Z”) converges uniformly to V"(@). w.p.1 [see (9A.28)]. Let us compute 
bounds for the probabilities. First note the following strengthening of Lemma 2B.2: 
Under the assumption of Theorem 2B.1, strengthened so that the eighth moments 
of {e(t)} are bounded, we have 


E(R*)' < C(N -rọ (9B.9) 


See Ljung (1984). Corollary 2, with y(n) = 1/n for a proof. Applying Chebyshev's 
inequality to the fourth moments then gives 


, 16 C 
PQQ) 2: 9B.10: 
and 
; 16 C 
PQY) < oS 9B.10b 
Let 


GY =O; AR 


Appendix 9B: The Asymptotic Parameter Vanance 315 


Then, for the complement event, we have from (9B.10) 


=N C 
P(Q ) < Ne (9B.11) 


Now consider for (9B.2) 
A * j u N17! 4 ? * 
EIN ên = ey < E{[[vien. Z | Vve za) 


The right side is an integration over w. Splitting the integration into the subsets QY 
=N r 
and Q, we find that on Q* 


fe r ô 
Vilen, Z`) > 5! 


(this is indeed the rationale behind defining the set Q~). Hence. with symbolic 
notation. 


EIN (Ôx — 6%)|* < </ 5 - IVN Vi (0%, ZN) do 


+f, N*l6y — Ohl do 
< C. E\|WNV, (6%, ZOÉ + N°- C. PQ) < C 
The second inequality follows since Ên and 0x belong to a bounded set Dy. The last 
inequality follows from (9B.11) and from Lemma 2B.2 applied in its strengthened 


version (9B.9) to the sum of zero mean variables N - Vo (0X. Z“); 
Theorem 4.5.2 of Chung (1974)now implies that 


N - E(6y — 0%)(Ôn — 0) > Pa, as N> œ (9B.12) 


with Pa being the variance of the asymptotic distribution. 
Finally. rewrite (9B.2) as 


Ve (Ok, ZY) = -VOD Ôn — OR) — [Vy En. ZN — VOR) Ân — OF) 
Taking expectation gives 
0 = VOP Ēn — OR) + E FIVE En, Z") — VOR] Ôx - 3) 
or. using Schwartz’s inequality, 
On — OF < [VOR] - [EIE (Ex. Z — VEn 
+ ELV" En) — V"OR IPI? - [Elx — ORF” 
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For the middle factor, we apply Lemma 2B.2 to the first term, showing it decays like 
C/N, and we use (assuming V” to be differentiable) 


IV"(Ew) — VOPI < Clén — OF] < Cléy — 65 
for the second term. showing that it decays like E|6x — OF, \- [i.e.. C/N according 
to (9B.12)]. Collecting this gives 
C 
N 


Clearly, (9B.12) and (9B.13) imply (9B.1). We can thus sum up the discussion in this 
appendix as follows: 


Eloy — 6x/" < (9B.13) 


Consider the estimate bx; under the conditions of Theorem 9.1, strength- 
ened so that the eighth moments of {eo(t)} in (8.4) are bounded, and that 
V (6) is three times continuously differentiable. Dispense with the assump- 
tion that VN Dy. > Oas N —> oc. Then (9B.1) and (9B.13) hold. 


Remark. The assumption in Theorem 9.1 that V N Dy tends to zero implies that 
6x — 6* sufficiently fast so as to allow 6* to be used in the asymptotic expressions 
instead of 0%.. 
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COMPUTING THE ESTIMATE 


In Chapter 7 we gave three basic procedures for parameter estimation: 


1. The prediction-error approach in which a certain function Vy(@. ZY) is mini- 
mized with respect to @. 


2. The correlation approach, in which a certain equation f(6, Z”) = 0 is solved 
for 8. 


3. The subspace approach to estimating state space models. 


In this chapter we shall discuss how these problems are best solved numerically. 

At time N, when the data set Z“ is known. the functions Vy and fy are 
just ordinary functions of a finite-dimensional real parameter vector 0. Solving the 
problems therefore amounts to standard questions of nonlinear programming and 
numerical analysis. Nevertheless, it is worthwhile to consider the problems in our 
parameter estimation setting, since this adds a certain amount of structure to the 
functions involved. 

The subspace methods (7.66) can be implemented in several different ways, 
and contain many options. In Section 10.6 we give the details of this. together with 
an independent derivation of the techniques. 


10.1 LINEAR REGRESSIONS AND LEAST SQUARES 


The Normal Equations 


For linear regressions, the prediction is given as 


$18) = 9" (Ne (10.1) 
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The prediction-error approach with a quadratic norm applied to (10.1) gives the least. 


squares method described in Section 7.3. The minimizing element 6° can then be 
written as in (7.34): 


AS = RIN) f(N) (10.2) 
with 
i 
T 

R(N) = N eD (t) (10.3) 

| N 
f(N) = yD POs) (10.4) 

t=l 


An alternative is to view O45 as the solution of 
R(N)OLS = f(N) (10.5) 


These equations are known as the normal equations. Note that the basic equation 
(7.118) for the IV method is quite analogous to (10.5), and most of what is said in this 
section about the LS method also applies to the IV method with obvious adaptation. 

The coefficient matrix R(N) in (10.5) may be ill conditioned. in particular if its 
dimension is high. There exist methods to find By; that are much better numerically 
behaved, which do not have the normal equations as a starting point. This has 
been discussed in an extensive literature on numerical studies of linear least-squares 
problems. Lawson and Hanson (1974)can be mentioned as a basic reference for 
the problems under consideration. The underlying idea in these methods is that the 
matrix R(N) should not be formed. since it contains products of the original data. 
Instead, a matrix R is constructed with the property 


RR? = R(N) 


Therefore, this class of methods is commonly known as “square-root algorithms” in 
the engineering literature. The term is not quite adequate, since no square roots are 
ever taken. It would be more appropriate to use the term “quadratic methods” when 
solving (10.5). 


Solving for the LS Estimate by QR Factorization 


There are some different approaches to the construction of R, such as Householder 
transformations, Householder (1964), the Gram-Schmidt procedure, Björck (1967). 
and Cholesky decomposition. We shall here describe an efficient wav using Q R- 
factorizations. See, e.g., Golub and van Loan (1996)for a thorough description of 
the method. The QR-factorization of ann x d matrix A ts defined as 


A=QR. QQ =]. R upper triangular (10.6) 


Here Q isn x n and Risn xd. 
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To apply this to the LS parameter estimation case, we rewrite the general mul- 
tivariable case (4.59) in matrix terms, by introducing 


Y = [ »7a) Se yTny |. Yis Np x 1 
(10.7) 
$7 =[g(1) ... g(N)], PisNp xd 
(Here p = dim yx). Then the LS criterion can be written (cf. (II.13)) 
N ; 
Vx(0, Z“) = IY — 0? = $ jxo — one (10.8) 


t=1 


The norm is obviously not affected by any orthonormal transformation applied to 
the vector Y — 6. Therefore if Q (pN x pN) is orthonormal, that is QQ? = 1. 
then 


vw(6, Z“) = |Q(Y — 8)/? 


Now, introduce the Q R-factorization 


Ro 
[8 Y]=QR. R=}... (10.9) 
0 
Here Ro is an upper triangular (d + 1) x (d + 1) matrix, which we decompose as 
R, R 
Ro = P a R, isd x d. Rrisd x 1, Ris scalar (10.10) 
3 


This means that 


2 R; R8]? : ; 
Vx(0. Z“) = |Q" - b) = lel -| i | = |R: — R19? + |R3l? 


which clearly is minimized for 
Rx = R, giving Vy(Ov.Z%) = |R} (10.11) 


There are three important advantages with this way of solving for the LS estimate: 


1. With R(N) as in (10.3), R(N) = p7p = RT R so the conditioning number 
(the ratio between the largest and smallest singular value) of R, is the square 
root of that of R(N). Therefore (10.11) is much better conditioned than its 
counterpart (10.5). 


2. R, is a triangular matrix. so the equation is easy to solve. 


3. If the QR -factorization is performed for a regressor size d”. then the solutions 
and loss functions for all models with fewer parameters—obtained by setting 
trailing parameters in 8 to zero—are easily obtained from Ro. 
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Note that the big matrix Q is never required to find 6, and the loss function. All 
information is contained in the “small” matrix Ro. In MATLAB considerable compu- 
tational saving is obtained by taking Retriu(qr(A)) to compute (10.6) if Q is of 
no interest. and A has many more rows than columns. 


Initial Conditions: ““Windowed™ Data 


A typical structure of the regression vector g(f) is that it consists of shifted data 
(possibly after some trivial reordering): 


z(t — 1) 
g(t) = (10.12) 
z(t — n) 


Here z(t — 1) is an r-dimensional vector. For example, the ARX model (4.11) with 
na = Np =n gives (10.12) with 


while an AR-model for a p-dimensional process {¥(t)} [cf. (4.57). ny = 0] obeys 
(10.12) with <(1) = — x(t). 

With the structure (10.12). the matrix R(N) in (10.3) will be an n x n block 
matrix, whose ij block is the r x r matrix 


N 
1 I 
RN) = Date DTE -f/) (10.13) 


i=l 


If we have knowledge only of z(t) for 1 < t < N. the question arises of how to deal 
with the unknown initial conditions for ¢ < 0 in (10.13). Two approaches can be 
taken: 


1. Start the summation in (10.3) and (10.4) at t = n+ 1 rather than att = 1. Then 
all sums (10.13) will involve only known data. [After a suitable redefinition of 
N and the time origin. we can of course stick to our usual expressions, assuming 
z(t) to be known for t > —n.] 


2. Replace the unknown initial values by zeros (“prewindowing™). Forsvymmetry. 
the trailing values z(t). t =N+1..... N +n, could also be replaced by zeros 
(“postwindowing”) and the summation in (10.3) is extended to N +n. In this 
case (10.5) are also known as the Yule-Walker equations. Often additional 
data windows (“tapering™: cf. Problem 6G.5) are applied to both ends of the 
data record to soften the effects of the appended zeros. 
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In speech processing these approaches are known as the covariance method and 
autocorrelation method, respectively (Makhoul and Wolf. 1972). No doubt, from 
a logical point of view. approach 1 seems the most natural. Approach 2. however. 
gives the special feature that the blocks (10.13) will depend only on the difference 
between the indexes: 


Rij(N) = RCN). SS a 


i N (10.14) 
R:ı(N) = a do elt — tz! (t) t > 0, analogously forr < 0 


t=T 


This makes R(N) a block Toeplitz matrix, which gives distinct advantages when 
solving (10.5). as we shall demonstrate shortly. 


Clearly, when N > n., the difference between the two approaches becomes 
insignificant. 


Levinson Algorithm (*) 


The shift structure (10.12) gives a specific structure for the matrix R(N). There is 
an extensive literature on fast algorithms that utilize such structures. The simplest, 
but generic. example of these methods is the Levinson algorithm (Levinson, 1947), 
which we shall now derive. 


Consider the case of an AR model of a signal 
Fale) = —ajy(t — 1) —--- — ayt — n) (10.15) 


(the upper index n indicates that we are fitting an n th-order model). This corresponds 
to a linear regression with (7) subject to (10.12) with c(t) = — y(t). If we apply the 
autocorrelation method, we should solve (10.5); that is, 


Ro Ri... Rai] fa? —R; 
Ri Ro Esa Rn-2 a; — R: 
i as eri (10.16) 
Rn-1 Rn-2 eee Ro an — Ra 
for aj. Here 
ieee 
R, = È" (t) = ott —t)y(t), t>0 (10.17) 


t=T 
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and we have dropped the argument N. Equation (10.16) can be rewritten as 


Ro Ri eee Ra 1 Vn 
Ri Ro... Raa || a 0 
Ry Ry... Ra ay |= 0 (10.18) 
R, Ra... Ro a” 0 


n 


Here the n last rows are identical to (10.16), while the first row is a definition of V. 


Suppose now that we have solved (10.18) for a’ and seek a solution for a 
n+l 


higher-order model (10.15) with order n +1. The estimates a; will then be defined 
analogously to (10.18). To find these, we first note that 
Ro Ri... Rai Rasy 1 Va 
Rı Ro... Ra-1 Rn ay 0 
nY ; : [=|: (10.19) 
Rn Ry-1--- Ro R, an 0 
Rayi Rn... Ri Ro 0 Xp 


Here the first n + 1 rows are identical to (10.18). while the last row is a definition 
of æn. The definition of a looks quite like (10.19). the only difference being that 
all but the first row of the right side should be zero. We thus seek to remove a,,. A 


moment’s reflection on (10.19) shows that it can also be written as 


/ 
Ro R... Rn Ravi 17 0 On, 


Ri Ro... Raa Ra a” 0 
: DS : DSE (10.20) 
Ry,  Raa.--Ro Ri a” 0 
Raai Ra... Ry Ro 1 Va 


since the coefficient matrix is a symmetric Toeplitz matrix. We can also view the last 
n + 1 rows of (10.20) as the normal equations for the regression 


5t — n — 110) = -a" y(t — n) — až y(t — n + 1) — -+ — at x(t — 1) (10.21) 


This is a reversed time model for the signal y(t). Since the second order properties 
of a scalar stationary signal are symmetric with respect to the direction of time, the 
coefficients in (10.21) coincide with those of (10.15). This is a signal theoretic reason 
for the equality between (10.19) and (10.20). See also the Remark at the end of this 
subsection. 
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Now multiply (10.20) by p, = —a,/V, and add it to (10.19). This gives 


Ro Rio... Ra Rags 1 Va + Pn -Qn 
R, Ro... Rea Rn ai + Pray 0 
me: 23 ; : = ; (10.22) 
Rn Rn-1..- Ro R; a + Pra? 0 
Rnai Rn... R Ro Pn 0 


This is the defining relationship for â? +*. Hence 


(10.23) 


=1 


(The hat here indicates the actual estimate, based on N data, as opposed to the 


general model parameters ay.) This expression allows us to easily compute art} 


from a; . With the initial conditions 


R2 
V = R- — 
1 0 Ro 
(10.24) 
Al —R; 
a, = — 
Ro 


we have a scheme for computing estimates of arbitrary orders. We note that going 
from a” to a7 *' in (10.23) requires 4n + 2 additions and multiplications and one 
division. The computation of â; thus requires proportional to 2n? operations, which 
is of an order of magnitude (in 7) less than the general procedures (10.8) to (10.11). 
Hence the term “fast algorithms.” 

The Levinson algorithm (10.23) has been widely applied and extended to the 
case of vector-valued z, as well as to “the covariance method.” See, for example, 
Whittle (1963), Wiggins and Robinson (1965), and Morf et.al. (1977). 
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Remark. The most important change when dealing with vector-valued - js 
that the corresponding reversed-time model (10.21) 


a n= Ney Sb 2G =) es S Bia 1) (10.25) 


gives b; estimates that differ from a;. The scheme (10.23) must then be comple- 
mented with an analogous scheme for updating the b;. See Problem 10G.1. 


Lattice Filters (*) 


Consider the predictors (10.15) for orders n and n + 1 evaluated at 


0 = ĝ = by 
Sn(016") = âf yt — 1) — +++ — Gt y(t — n) 
; i (10.26) 
Sng i(tlO"*!) = -âit — 1) oee — aft y(t — n) 
—atly-n- 
Subtracting these expressions from each other gives, using (10.23), 
Snp tl" = Sn(tl6") — BnFalt — 1) (10.27) 


where 
F(t —1) = yt —n — 1) +â yt- n) +--+. + ay 1) (10.28) 


We recognize in (10.28) the error in the reversed-time predictor (10.21), Let us. with 
the definition (10.28), consider 


Pailt) = yt —n —1) ât ya n) +--+ dni yer —1) +a" tiytr) 
Subtracting (10.28) from this gives, again using (10.23), 
Palt) — Palt — 1) = ba[y(t) + aye — 1) +--+ + Ghy(t — n)] 10.29) 
With the prediction error 
én(t) = y0) — Sn(t6") 


the expressions (10.27) to (10.29) can be summarized as 


Entit) = Enlt) + Pnfalt — 1) (10.30a) 


Fati(t) = Fat —-1)+ Pnén(t) (10.30b) 
€o(t) = Folt) = y(t) (10.30c) 
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This simple representation of the prediction errors [and the predictions ty (t]87) = 
v(t)—é, (t) ] can be graphically represented as in Figure !0.]. Because of the structure 
of this representation, (10.30) is usually called a lattice filter (sometimes a ladder 
filter). 

An important feature of the representation (10.30) is that the variables £, and 
r, abey the following orthogonality relationships: 


N 
' aa V,. ifk =0 
N dog balt Pal =k) = ns ifk Æ Q 
t= 


i! 


? s PESE | 
Pi ifa =k (10,31) 


N 
] sae 
N 2 Fale if 7 £ k 
i= 


N 
l y à Q,, ifn =k 
N EDR = US 0 ifn >k 


f=! 


(see Problem 10D.1). Hence the reflection coefficients Òn can easily be computed as 


N 
DDAA ST) 
pn = (10.32) 


The scheme (10.30) together with (10.32) also forms an efficient way of es- 
timating the reflection coefficients Ða. as well as the predictions, as an alternative 
to the Levinson algorithm. An important aspect is that the scheme produces all 
lower-order predictors as a by-product. Lattice filters have been used extensively in 
signal-processing applications. See. for example. Makhoul (1977). Griffiths (1977), 
and Lee. Morf. and Friedlander (1981). See also Section 11.7 for recursive versions. 


6,(f) 


fn- D 


Ft) P(t-1) 


Figure 10,1 A lattice filter representation. 
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10.2 NUMERICAL SOLUTION BY ITERATIVE SEARCH METHODS 


In general, the function 


N 
y 1 
Vx(0. Z“) = 5 DUE (c(t, 6).8) (10.33) 


t=) 


cannot be minimized by analytical methods. Neither can the equation 


N 
Ny = 1 P 
0 = fx(9.2") = = $ cl, Dalet. 6) (10.34) 


t=1 


be solved by direct means in general. The solution then has to be found by iterative, 
numerical techniques. There is an extensive literature on such numerical problems. 
See. for example. Luenberger (1973), Bertsekas (1982), or Dennis and Schnabel 
(1983)for general treatments. 


Numerical Minimization 


Methods for numerical minimization of a function V(@) update the estimate of the 
minimizing point iteratively. This is usually done according to 


67D = BO yaf” (10.35) 


where f“’ is a search direction based on information about V (0) acquired at presi- 
ous iterations, and g is a positive constant determined so that an appropriate decrease 
in the value of V (0) is obtained. Depending on the information supplied by the user 
to determine f“? numerical minimization methods can be divided into three groups: 


1. Methods using function values only. f 
2. Methods using values of the function V as well as of its gradient. 


3. Methods using values of the function, of its gradient, and of its Hessian (the 
second derivative matrix}. 


The typical member of group 3 corresponds to Newton algorithms, where the cor- 
rection in (10.35) is chosen in the “Newton” direction: 


; ana Tal Ags 
foh = - [vrea] v'(e'”) (10.36) 


The most important subclass of group 2 consists of quasi-Newton methods, 
which somehow form an estimate of the Hessian and then use (10.36). Algorithms 
of group 1 either form gradient estimates by difference approximations and proceed 
as quasi-Newton methods or have other specific search patterns. See Powell (1964). 

Many standard programs implementing these ideas are available. The easiest 
way for the identification user could be to supply such a program with necessary 
information and leave the search for the minimum to the program. In any case. it 
will be necessary to compute the function values of (10.33) for any required value 
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of 0. The major burden for this lies in the calculation of the sequence of predic- 
tion errors e(f.0).f =1,.... N. This itself could be a simple or complicated task. 
Compare, for example. the model structures of Section 4.2 with the one in Exam- 
ple 4.1. 

The gradient of (10.33) is 


V6. Z*) = — 


R 
ly 

x XO (vr. Oe. (E(t. 0). 0) — Eelt. 0D} (10.37) 
t=] 


Here. as usual. y(t.) is the d x p gradient matrix of ¥(t|8)(p = dim yx} with 
respect to 8. The major computational burden in (10.37) lies in the calculation of 
the sequence W(t, 9).t =1.2..... N. We discuss in Section 10.3 how this gradient 
is computed for some common model structures. However, for some models, direct 
calculation of y could be forbidding. and then one has to resort to the minimization 
methods of the group 1 or to form estimates of y by difference approximations. 


Some Explicit Search Schemes 


Consider the special case of scalar output and quadratic criterion 


N 
l , 
Vv(@. 2%) = N 9 ie lto) (10.38) 


t=l 


This problem is known as “the nonlinear least-squares problem” in numerical anal- 
ysis. An excellent and authoritative account of this problem is given in Chapter 10 
of Dennis and Schnabel (1983). The criterion (10.38) has the gradient 


N 
t Ny — 1 
Vp (0. Z“) = HL Wit Belt. 8) (10.39) 


t=1 


A general family of search routines is then given by 
Oy = BY) — HN [Ry T Vy Oy. 2") (10.40) 


A(t r à fí) .- A : 

where ô!’ denotes the jth iterate. Re is ad x d matrix that modifies the search 
N N i 

direction (it will be discussed later), and the step size lin. is chosen so that 


Vy (On. ZY) < Vy (ON). Z“) (10.41) 


We should also keep in mind that the minimization problem normally is a constrained 
one: 0 € Dm. Often. however, 3 Dam. the boundary of Dm. corresponds to the 
stability boundary of the predictor (cf. Definition 4.1) so that Vy (@. Z™) increases 
rapidly as @ approaches 3 Dat. Then the constraint can easily be obeyed by proper 
selection of u. 
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The simplest choice of RM is to take it as the identity matrix, 
RV =1 (10.42) 


which makes (10.40) the gradient or steepest-descent method. This method is fairly 
inefficient close to the minimum. Newton methods typically perform much better 
there. For (10.38). the Hessian is 


N N 
n N 1 T l 7 x 
Vie. Z“) = Wy veo (t.6) — TA (1. A)e(t. 8) (10.43) 


t=) t=1 


where w’(t. 0) is the d x d Hessian of £(t. 0). 
Choosing 


RY = vÂ, Z“) (10.44) 


makes (10.40) a Newton method. It may. however, be quite costly to compute all 
the terms of Y’. Suppose now that there is a value 6o such that the prediction errors 
Elt. 69) = e(t) are independent. Then this value yields the global minimum of 
EVy (6. Z™). Close to 6) the second sum of (10.43) will then be close to zero since 
Ew’'(t. @)eo(t) = 0. We thus have 


I]t> 


N 

EE We 
r: Nyan T - 
Vy 0, Z“) x p2 Yow (t.0) = Hy(@) (10.48) 


If we apply a Newton method to the minimization problem. we need a good estimate 
of the Hessian only in the vicinity of the minimum. The reason for this is that 
Newton methods are designed to give one-step convergencg for quadratic functions. 
When the function values between the current iterate and the minimum cannot be 
approximated very well by a quadratic function, the effect of the Hessian in (10.36) 
is not so important. Moreover. by omitting the last sum in (10.43) the estimate of 
the Hessian is always assured to be positive semidefinite. This makes the numerical 
procedure a descent algorithm and guarantees convergence to a stationary point. 
The conclusion is consequently that 


RY = Hy 6X’) (10.46) 


is a quite suitable choice for our problem. This is also known as the Gauss-Newton 
method. In the statistical literature the technique is called “the method of scoring.” 
(Rao, 1973). In the control literature the terms modified Newton-Raphson and 
quasi-linearization have also been used. Dennis and Schnabel (1983)reserve the 
term Gauss-Newton for (10.40) and (10.46) with the particular choice jy = 1,and 
suggest the term damped Gauss-Newton when an adjusted step size yz is applied. 
Even though the expression (10.45) is assured to be positive semidefinite, it may 
be singular or close to singular. This is the case, for example, if the model is over- 
parametrized or the data not informative enough. Then some numerical problems 
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arise in (10.40). Various ways to overcome this problem exist and are known as reg- 
ularization techniques. One common way is the Levenberg-Marquardt procedure 
{Levenberg. 1944: Marquardt. 1963). Then an approximation 


N 
RY) = Yva. ÔT, ON) +A (10.47) 
t=1 
is used for the Hessian. Here A ts a positive scalar that is used to contro] the con- 
vergence in the iterative scheme rather than the step size parameter. We thus have 
ieee = 1 in (10.40). With A = Q we have the Gauss-Newton case. Increasing A 
means that the step size is decreased and the search direction is turned towards the 
gradient. Several schemes for manipulating A based on the test (10.41) have been 
suggested. see, e.g.. Scales (1985). 

If the minimum does not give independent prediction errors, the second sum 
of (10.43) then need not be negligible close to the minimum, and (10.45) need not 
be a good approximation of the Hessian. A typical method then is to make use of 
the known first sum of the Hessian and estimate the second sum with some secant 
technique (see Dennis. Gay, and Welsch, 1981). 


Correlation Equation 


Solving the equation (10.34) is quite analogous to the minimization of (10.33) (see, 
e.g.. Dennis and Schnabel. 1983). Standard numerical procedures are the substitution 
method (corresponding to (10.40) and (10.42)): 


BP = BEY — pO? fy GY, 29) (10.48) 
and the Newton-Raphson method [corresponding to (10.40) and (10.44)]: 


Ai Ati— 2 t Ali N =l A N 
Ay = By? — WP [FON ZO] AON. Z (1049) 


10.3 COMPUTING GRADIENTS 


To use the formulas in the previous section, we need expressions for y(t. 0), the 
gradient of the prediction. The amount of work required to compute w(f.@) is 
highly dependent on the model structure, and sometimes one may have to resort 
to numerical differentiation. In this section we shall provide expressions for some 
common model structures. 


Example 10.1 The ARMAX Model Structure 
Consider the ARMAX model (4.14). The predictor is given by (4.18): 


C(q)¥(t|9) = Big)u(t) + [C(q) — A(q)] y(t) (10.50) 


Differentiating this expression with respect to ag gives 


C (gy sao) = —q™*y(t) (10.51) 
day 
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Similarly, 
C(q) $40) ~*u(z) 
— y = = u 
and 


z a. $ 
gq * Sale) + Cla) — ste) = —g™* yr) 
ack 


With the vector g(t, 0) defined by (4.20), these expressions can be rewritten conve- 
niently as 


Ciaya, 8) = p(t, 8) (10.52) 


The gradient is thus obtained by filtering the “regression vector“ (t.0) through 
the filter 1/C(g). This filter is stable for all 6 for which the predictor (10.50) is 
stable. = 


SISO Black-box Model (*) 


Most formulas for SISO black-box models will be contained in a treatment of the 
general model (4.33). The predictor for this model is given by (4.35): 


D(q) Bq) ja | sig) 
C(q)F(q) Ca) j 


from which we find, as in Example 10.1, that 


Da) o e 
Cui k) (10.54a) 


aaea 2. 
Ciq)F (q) 


D(q)B(q) D(q)A(q) 
a a a oe 
Cac@ra © e@ic@: A 


1 
C(q) 
Bq) A(q) 


ə 
— $ tl) = ——utt — k) — (t —k 
a CaiF@ Car a 


y(r|6) = u(t) + | (10.53) 


ð, 

— “(£10 = — 
dan ) 

2 EIO = (t — k) 10.54b 
Ob,” = ee) 


3 
— tre) = 
3a. ) 


elt — k. 8) (10.54c) 


1 
=— u(t — k,@ 10.54d) 
Cg) ( ) ( 


a. D(q)B(q) 
— y(1|8) = ————qwque q~ 
y  C@F@)F@) 


D(q) : 
oe i wlt — k,@ 10.54e) 
Ca@F@ So 


u(t — k) 
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where we used £, v. and w as defined by (4.37) to (4.39). The gradient W (z, 6) is thus 
also in this case obtained by filtering the regression vector g(t. 0) [defined in (4.40)] 
through linear filters, although different parts of y are associated with different filters 
in the general case. It is clear that the filters involved here are all stable for @ € Dm. 
defined in Lemma 4.1. which are also the 0 for which the predictors are stable. 

In the special case of an output error model, A(q) = C(q) = D(q) = 1. [see 
also (4.25)], we obtain from (10.54b, e) 


F(q)W(t.@) = g(t. 0) (10.55) 


General Finite-dimensional Linear Time-invariant Models (*) 


A linear time-invariant finite-dimensional model can always be represented as 


(t) 
g(t +1,80) = F(O)p(t.0) + G0) f l 


u(t) (10.56) 


Salo) = H (Oplt. @) 


with proper choices of the matrices F. G, and H and with dim y =n. This is true 
for the general SISO model (4.35) for which y(t. @) can be chosen as (4.40). as well 
as for the general state-space mode! (4.86) for which 


F8) = A0) — K(6)C(@) (10.57a) 
G(@) = [ K0) B@)] (10.57b) 
H8) = C(O), x(t.8) = g(t. 9) (10.57c) 


Stability of the predictor (10.56) requires @ to belong to 
Dx = {6|F(@) has all eigenvalues inside the unit circle} (10.58) 
The equation (10.56) can now be differentiated with respect to 6. Introducing 


(1.6) = [P4.0 ipe) ... Belton] (10.59) 


we may, for some matrices A(@), B(@). and C(@). write 


V(r) 
Elt +1,0) = A(P)E(t. 0) + B8) | a 
u 


F(t|0) 
= C(@)E(t.) 
w(t. @) 


It can readily be verified that the (d + 1)” x (d + 1)n matrix -A(@) will contain 
the matrix ¥(@) in each of its d + 1 block-diagonal entries and has all zeros above 
the block diagonal. Hence the stability properties of A(@) coincide with those of 
F (9). 


(10.60) 
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We will find (10.60) useful for developing general algorithms. Clearly. however. 
these equations, as given, will not be suited for practical calculations of the gradient. 
since the filter is of order d(n + 1). It has been shown, though, by Gupta and 
Mehra (1974)that the dimension of the controllability subspace of (10.60) does not 
exceed 4n regardless of the parametrization and regardless of the value of 4. By 
proper transformations, the necessary calculations involved in (10.60) can thus be 
substantially reduced. 

Finally, we note that when the methods of Section 10.2 are used the expressions 


of the present section have to be used between each iteration to compute yw (t, A. D 
This means that we have to run the data from rf = 1 to N through filters like (10.54) 
or (10.60) for each iteration in (10.40). 


Nonlinear Black-Box Models and Back-Propagation 


In connection with neural networks, the celebrated Back-Propagation (BP) algo- 
rithm is used to compute the gradients of the predictor. Back-propagation has been 
described in several contexts. see e.g., Werbos (1974), Rumelhart. Hinton. and Cy- 
benko (1986). Sometimes in the neural network literature the entire search algorithm 
is called Back-Propagation. It is. however, more consistent to keep this notation just 
for the algorithm used to calculate the gradient. 

Consider the general nonlinear black-box structure (5.43): 


8(y.0) = > ann (Be(y — ¥x)) (10.61) 
k=1 


To find the gradient y(r. 0} = (d/d0)g(p.@) for this one hidden layer network is 
quite simple. We just need to compute 


d 
Fa tk Pe — y) = «(Se — y) j 


d i 
ay ey — y) = —ak (Be — y) 


ak (by - y) = ax (By — y)e 
P 
The BP algorithm in this case means that the factor ax'(By — y ) from the derivative 
with respect to y is re-used in the calculation of the derivative with respect to £. 
The Back-Propagation algorithm is however very general and not limited to 
one-hidden-layer sigmoid neural network models. Instead, it applies to all network 
models and it can be described as the chain rule for differentiation applied to the 
expression (5.47) with a smart re-use of intermediate results which are needed al 
several places in the algorithm. For ridge construction models (5.42) where 8; is a 
parameter vector. the only complicated thing with the algorithm is actually to keep 
track of allindexes. When £; is a parameter matrix, like in the radial approach (5.40). 
then the calculation becomes somewhat more complicated. but the basic procedure 
remains the same. See Saarinen, Bramley, and Cybenko (1993) for an illuminating 
description of these general aspects. 
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When shifting to multi-laver network models, the possibilities of re-using in- 
termediate result increase and so does the importance of the BP algorithm. 

For recurrent models the calculation of the gradient becomes more compli- 
cated. The gradient y(t. @) at one time instant does not only depend on the regressor 
g(t, 0) but also on the gradient at the previous time instant y(t — 1,0). See Nerrand 
et.al. (1993)for a discussion on this topic. The additional problem to calculate the 
gradient does. however, not change anything essential in the minimization algorithm. 
In the neural network literature this is often referred to as Back-propagation through 
time. 


Recursive Techniques for Off-line Problems 


An idea to save work in the minimization procedure could be to combine the methods 


of Sections 10.2 and 10.3 so as to modify the estimate gu in (10.40) at the same 
time as the prediction error gradients are computed in (10.60) [i.e., to “link the 
index (i) to N™]. We shall develop such recursive algorithms in Chapter 11 for 
on-line applications. These are. however. quite useful also for off-line problems as 
alternatives to (10.40). Then typically the data record is run through the recursive 
algorithm a couple of times, and it can be shown that such a procedure will have the 
same convergence properties as (10.40). See Ljung and Söderström (1983), Section 
7.2, and Solbrand, Ahlén, and Ljung (1985)for further details, 


10.4 TWO-STAGE AND MULTISTAGE METHODS 


The techniques described in Sections 10.2 and 10.3 should be regarded as the basic 
numerical methods for parameter estimation. They have the advantages of guar- 
anteed convergence (to a local minimum). efficiency, and applicability to general 
model structures. Nevertheless, the literature is abundant with alternative tech- 
niques, mostly related to special cases of the general linear model structure (4.33): 


A@ytt) = ER u + 
F(q) D(q) 
(or multivariable counterparts), and to the general nonlinear black-box structure 
(10.61). A basic idea is to rephrase the problem as a linear regression problem or 
a sequence of such problems, so that the efficient methods of Section 10.1 can be 
applied. For (10.61) it may involve fixing the parameters that enter nonlinearly (i.e. 
g and y ) and estimate the as as a linear regression. 

The algorithms typically involve two or several LS stages (or IV stages) ap- 
plied to different substructures. and we therefore call them two-stage or multi-stage 
methods. 

In this section we shall give a short description of the building blocks of such pro- 
cedures. Mixing techniques (IV, LS. PEM. PLR) and models (FIR. ARX, ARMAX, 
etc.) into procedures involving several stages leads to a myriad of “identification 
methods.” There will be no need to list all these. They can, however, be understood 
and analyzed by our techniques, applied to the different stages (see Problems 10G.2 


e(t) (10.62) 


334 


Chap. 10 Computing the Estimate 


and 10E.1). Our interest in this topic is twofold: it helps us to understand the iden. 
tification literature, and the techniques may be useful for providing initial estimates 
for the basic schemes of Section 10.2. 


The subspace method (7.66) can also be regarded as a two-stage method. being 
built up from two LS-steps. Due to the rather complex nature of this algorithm. we 
will treat it separately in Section 10.6. For the rich possibilities offered for nonlinear 
black-box models, we refer to Sjöberg et.al. (1995). 


Bootstrap Methods 


Consider the correlation formulation (7.110): Solve 


N 
1 
fu. Z%) = FREEON = 0 (10.63a) 


t=1 
in the special case where the prediction error can be written 

elt.0) = y(t) — g(t. 9) (10.63b) 
This formulation contains a number of common situations: 


e IV methods with ¢ (t, @) as in (7.127) and g(t. 6) = g(t) as in (7.114). 
è PLR methods with ¿ (4, 8) = g(t. 0) as in (7.113). 


e Minimizing the quadratic criterion (10.38) for models that can be written as 
(7.112), taking ¢ (t, 0) = w(t, 0). 
With a nominal iterate oy") at hand, it is then natural todetermine the next one by 
solving 


N 
ix ies ne 
ee Oy) [rw — gtr, 6M Pe] ai 
t=l 


for @. This is a linear problem and can be solved as 


N 


N Al 
Ajg 1 PETES A ji—}] 1 A ff am 
ôy = È VO 50.8 oTt, By j È DLA Pon (10.64) 


Solving (10.64) is essentially a least-squares problem (10.2) with proper definitions 
of R(N) and f(N). The techniques described in Section 10.1 thus apply also to 
(10.64). 

The algorithm (10.64) is known as a bootstrap method, since it alternates be- 
tween computing @ and forming new vectors g and ¢. It should be noted that it does 
not necessarily converge to a solution of (10.63). A convergence analysis is given by 
Stoica and Söderström (1981b), and Stoica et.al. (1985). 
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Bilinear Parametrizations 


For some model structures, the predictor is bilinear in the parameters. That is. the 
parameter vector @ can be split up into two parts. 


ial 


SIO) = Filo, n) (10.65) 


such that 


is linear in p for fixed n and linear in n for fixed p. A typical such situation is the 
ARARX structure (4.22): 


$I) = Biq) Diq) + [1 — ADE) (10.66) 


Clearly, by associating p with the A- and B-parameters and ņ with the D-parameters 
the preceding bilinear situation is at hand. 
With this situation, a natural way of minimizing 


N 
: | 7 £69 
Vv(6. Z“) = Vyp, n 2") = = DO (x0) — flo. n) (10.67) 


i=l 
would be to treat it as a sequence of least-squares problems. Let 
py = argmin Vyp, fn. ZY) (10.68a) 
p 


= arg min Vy (py n. Zz’) (10.68b) 
q 


Each of these problems is a pure least-squares problem and can be solved efficiently. 
Although this procedure bears some resemblance to the bootstrap methods, it is 
indeed a minimization method that will lead to a local minimum (cf. Problems 10T.3 
and 10E.9). 


Separable Least Squares 


A more general situation than the bilinear case is when one set of parameters enter 
linearly and another set nonlinearly in the predictor: 


SEI, n) = 67 p(t, n) (10.69) 
The identification criterion then becomes 
N 
; 2 
Vx (0.n.Z*) = YOO — 07 ee. = LY — dol (10.70) 


i=1 


where we introduced matrix notation. analogously to (10.7). For given 7) this criterion 
is an LS criterion and minimized w.r.t. 0 by 


ô = [Ooi] OY (10.71) 
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We can thus insert this into (10.70) and define the problem as 
I — Pim) YP. 


min |Y — PYF 
n 


P(n) = DM [PME] O7() (10.72) 


The estimate @ is then obtained by inserting the minimizing 7 into (10.71). Note 
that the matrix P is a projection matrix: P? = P. The method is called sepuruble 
least squares since the LS-part has been separated out. and the problem reduced toa 
minimization problem of lower dimension. See Golub and Pereyra (1973)for a thor- 
ough treatment of this approach. It is known to give numerically well-conditioned 
calculations, but does not necessary give faster convergence than applying a damped 
Gauss-Newton method to (10.70) without utilizing the particular structure. 


High-Order AR(X) Models 
Suppose the true system is given as 

y(t) = Golg)utt) + Ho(qeo(t) 
and an ARX structure 

A™(q)y(t) = BY (qult) + elt) 


of order M is used. Then it can be shown (e.g., Hannan and Kavalieris. 1984. and 
Ljung and Wahlberg, 1992) that as the number of data N tends to infinity. as well 
as the model M (N “faster than” M), the model A”, BM will converge to the true 
system in the following sense: 


BM (ei) io ‘ A ’ 
== > Gye”). uniformly in was N > M > x 
AM (ef) i 
1 , 
—— > He”). uniformly in was N > M > oc 
AM (ele) 
N 


This means that a high-order ARX model is capable of approximating any linear 
system arbitrarily well. It is of course desirable to reduce this high-order model to 
more tractable versions within the structure (10.62), and for that purpose a number 
of different possibilities are at hand: 


1. Find G = B/A asa rational structure by eliminating common factors in AY 
and B™ (Söderström. 1975b). 


2. Apply model reduction techniques based on balanced realizations to B ie j A o 
(Wahlberg, 1986, Zhu and Backx. 1993). 


3. Let z(t) be the output of the model BM AM driven by the actual input u and 
apply an ARX model to the input-output pair (z, u) (Pandaya, 1974; Hsia. 
1977, Chapter 7). 
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4, Let éy(t) be the residuals associated with the model AM, BM. Use a model 
structure 


A(q)y(t) = Biq)u(t) + [C(q) — Neu) + elt) (10.73) 


to estimate A. B, and C. Since êm (t) is a known sequence, this structure is an 
ARX structure with two inputs, and the estimates are thus determined by the 
LS method (Mayne and Firoozan. 1982). 

5. The subspace method (7.66) {which is described in detail in Section 10.6) should 
rightly be included in this family too: The k-step ahead predictors computed 
from (7.62) are the high order ARX-models, while (7.60) corresponds to the 
model reduction step, and (7.56) is the second LS-stage. 


Separating Dynamics and Noise Models 


In the general linear model (10.62). we can always determine the dynamic part from 
u to y using the IV method. Splitting the denominator estimate thus obtained into 
one factor A(q) that is supposed to be in common with the noise description and one 
factor F(q) that is particular to the dynamics {typically one would postulate one of 
A and F to be unity), we can then determine 
: > Biq) 
bt) = ÀE) — ulr) (10.74) 
F(q) 
as an estimate of the equation noise [cf. (4.38)]. This noise can then be regarded as 
a measured signal, and an ARMA model 
i C(q) 
vir) = ——e(t) 10.75) 
DQ) i 
can be constructed as a separate step. Young has developed this technique in a 
number of papers (see, e.g.. Young and Jakeman. 1979). 


Determining ARMA Models 


The parameters of the ARMA model (10.75) can of course be estimated using the 
prediction-error approach. Two alternatives that avoid iterative search procedures 
are as follows: 


1. Apply a high-order AR model to (z) in (10.75) to form estimates of the inno- 
vations ê. Then form the ARX model 


Diq) = [C(q) — let) + elt) (10.76) 
with v(r) as output and ê(r } as input. and estimate D and C with the LS method 


[cf. (10.73)]. 


2. Estimate the AR parameters D(q) using the IV method as explained in Prob- 
lem 7E.1. Then model w(t) = D(g)t(t) as an MA model. See Durbin 
(1959)and Walker (1961)for related techniques. and Broersen (1997)for a dis- 
cussion of order selection. 
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10.5 LOCAL SOLUTIONS AND INITIAL VALUES 


Local Minima 


The general numerical schemes for minimization and equation solution that we dis- 
cussed in Section 10.2 typically have the property that. with suitably chosen step 
length yz. they will converge to a solution of the posed problem. This means that 
(10.48) and (10.49) will converge to a point 0% such that 


fu(Ox.Z%) =0 (10.77) 


while (10.40), with positive definite R. converges to a local minimum of Vy (0. Z*), 

For the minimization problem, it is the global minimum that interests us. 
The theoretical results of Chapters 8 and 9 dealt with properties of the globally 
minimizing estimate Oy. Similarly, the equation (10.77) may have several solutions. 
It is obviously an inherent feature of the iterative search routines of Section 10.2 that 
only convergence to a local solution of the problem can be guaranteed. To find the 
global solution, there is usually no other way than to start the iterative minimization 
routine at different feasible initial values and compare the results. An important 
possibility is to use some preliminary estimation procedure to produce a good initial 
value for the minimization. See the following discussion. 

When validating the model as we shall discuss in Sections 16.5 and 16.6. the 
model is judged according to its performance, though. Therefore, local minima do 
not necessarily create problems in practice. If a model passes the validation tests. it 
should be an acceptable model. even if it does not give the global minimum of the 
criterion function. 

The problem of “false” local solutions has two aspects. Let us concentrate on 
the problem of local minima. It may be that the limit of the criterion function as 
N tends to infinity, V(8), has such local minima. Then also Vy(0, Z“) will have 
such minima for large N , according to Lemma 8.2. The existence of local minima of 
V(@) can be analyzed. but only few results are available as yet. Some of them will 
be given later. The other aspect is that, even if V(@) has only one local minimum 
(= the global one), the function Vy (0, Zz“) may have other local minima due to the 
randomness in data. This is a much harder problem to treat analytically. The one 
exception is the linear regression least-squares method, where by construction the 
criterion function has no nonglobal loca] minima regardless of the properties of the 
data. 


Results for SISO Black-box Models 


The only analytical results available on local solutions are for black-box models under 
the assumption that the system can be described within the model set: S € M. We 
list these results here with references for proofs. They all concern the general SISO 
model set (10.62), and refer to 


V(6) = Ele*(t, 0) 
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For simplicity we call a nonglobal. local minimum a “false minimum.” 


e For ARMA models (B = 0. D = F = 1) all stationary points of V(@) are 
global minima (Astrém and Söderström, 1974). 


e For ARARX models (C = F = 1) there are no false local minima if the 
signal-to-noise ratio is large enough. If it is very small, false local minima do 
exist (Söderström, 1974). 


e If A = 1, there are no false local minima if nf = 1 (Söderström. 1975c). 


è IfA =C = D = 1, there are no false local minima if the input is white noise. 
For other inputs, however, false local minima can exist (Söderström, 1975c). 


For the ARMAX model (F = D = 1), it is not known whether false local 
minima exist. For the pseudolinear regression approach (7.113), it can. however, be 
shown that 


Eg(t, et.0) = 0 > 6 = % (10.78) 


in the case of an ARMAX model, with 0 denoting the true parameters (Ljung, 
Söderström. and Gustavsson, 1975}. 


The practical experience with different model structures is that the global min- 
imum is usually found without too much problem for ARMAX models. See, for 
example, Bohlin (1971)for a discussion of these points. For output error structures, 
on the other hand, convergence to false local minima is not uncommon. 


Initial Parameter Values 


Due to the possible occurrence of undesired local minima in the criterion function. it 
is worthwhile to spend some effort on producing good initial values for the iterative 
search procedures. Also, since the Newton-type methods described in Section 10,2 
have good local convergence rates, but not necessarily fast convergence far from the 
minimum, these efforts usually pay off in fewer iterations and shorter total computing 
time. 

For a physically parametrized model structure, \t is most natural to use our 
physical insight to provide reasonable initial values. Also, it allows us to monitor 
and interact with the iterative search scheme. 

For a linear black-box model structure several possibilities exist. It is our 
experience that the following is a good start-up procedure for the general model 
structure (10.62): 


1. Apply the IV method to estimate the dynamic transfer function B/ AF. Most 
often one of A and F is unity. For a system that has operated in open loop. 
first a LS estimate of an ARX model can be determined. to be used in the 
generation of instruments as in (7.123). (10.79a) 


2. Determine an estimate of the equation noise as in (10.74). (10.79b) 
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3. Determine C and/or D in (10.75) by (10.76) after a first high-order AR step to 
find é (which is unnecessary if C = 1). The order of the AR model can be cho- 
sen as the sum of all model orders in (10.62) so as to balance the computational 
effort. (10.79¢) 


Incase S € M this will bring the initial parameter estimate arbitrarily close 
to the true values as N increases. From there. the methods of Section 10.2 will 
efficiently bring us to the global minimum of the criterion. In this case we thus have 
a procedure that is globally convergent to the global minimum for large enough N. 

For a nonlinear black-box structure (10.61) several techniques exist. A simple 
one is to “seed” a large number of fixed values of the non-linear parameters $; and 
yk, and estimate the corresponding as by linear least squares. The estimates a, 
that are most significant (relative to their estimated standard deviations) are then 
selected and the corresponding yg and f, are used as initial values for the ensuing 
Gauss-Newton iterative search. 


Initial Filter Conditions 


The filters (10.53)-(10.54) as well as (10.56) require initial values y(0, 8) to be ini- 
tialized. In case the filters have finite impulse response, which happens only for 
the ARX special case, we can wait to initialize the filters until enough past data are 
known. This is what we called approach 1 following (10.13) in Section 10.1. In the 
general case we need a strategy to deal with the unknown initial conditions. This has 
not been extensively discussed in the literature, but we can point to the following 
approaches: 


1. Take (0, 0) = 0. 
2. Select g(0. 0) so that the first $(7|9).f = 1..... dime match y(t) exactly. 
3. Introduce (0, 0} = ņ as a parameter. and estimate it along with @. 


4. Estimate or “backforecast™ g(0. 9) from the data by running suitable filters 
backwards in time, Knudsen (1994). 


For a model where the predictor filter transient is short compared to the data record, 
it does not matter so much which approach is taken. However, with slowly decaving 
transients, the two first methods may have a very negative influence on the model 
quality. This is particularily pronounced for OE models, where no noise model will 
pick up the residuals from bad transient behaviour. 


10.6 SUBSPACE METHODS FOR ESTIMATING STATE SPACE MODELS 


Let us now consider how to estimate the system matrices A. B, C. and D ina state 
space model 


x(t +1) = Ax(t) + Bult) + w(t) 


(10.80) 
y(t) = Cx(t) + Du(t) + v(t) 
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We assume that the output y(ż) is a p-dimensional column vector, while the input 
u(t) is an m-dimensional column vector. The order of the system, i.e., the dimension 
of x(t). 1s 2. We also assume that this state-space representation is a minimal realiza- 
tion. It is well known that the same input-output relationship can also be described 
by 


x(t +1) = T'AT + T Bult) + wr) 


(10.81) 
y(t) = CTx(t) + Dult) + vít) 


for any invertible matrix 7. This corresponds to the change of basis x(t) = T y(n) 
in the state space. 


In Section 7.3, algorithm (7.66). we presented an archetypical algorithm for this. 
We shall here describe a family of related algorithms which all address this problem. 
The discussion will be quite technical and a reader who primarily is interested in the 
result could go directly to (10.125). In summary the algorithms are based on the 
following observations: 


e If A and Ĉ are known. it is an easy linear least squares problem to estimate B 
and D from 


y(t) = Ĉ(qI — A)“'Bu(t) + Dult) + v(t) (10.82) 


using the predictor 
SIB. D) = C(gI — A)~'Bu(t) + Du(t) (10.83) 


(The initial state x(0) can also be estimated; see (10.86) below.) 


e If the (extended) observability matrix for the system 


C 
CA 
0O, = l (10.84) 
Car- 
is known. then it is easy to determine C and A. Use the first block row of O, 
and the shift property. respectively. This is really the key step. 


e The extended observability matrix can be consistently estimated from input- 
output data by direct least-squares Jike (projection) steps. 


e Once the observability matrix has been estimated. the states x(t) can be con- 
structed and the statistical properties of the noise contributions w(t) and v(t) 
can be established. 


We shall now deal with each of these steps in somewhat more detail. 
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Estimating B and D 
For given and fixed A and C the model structure (10.83): 
$(t|B, D) = C(qi — A)~'Bu(t) + Du@t) (10.85a) 


is clearly linear in B and D. The predictor is also formed entirely from past inputs, so 
it is an output error model structure. If the system operates in open loop. we can thus 
consistently estimate B and D according to Theorem 8.4, even if the noise sequence 


u(t) = Cigi — A) wit) + v(t) 


in (10.82) is non-white. 
Let us write the predictor (10.83) in the standard linear regression form 


E E A 10.85b 
Se) = 9 p Vec(D) (10.85b) 


with a p x (mn + mp) matrix g(t). Here “Vec” is the operation that builds a vector 
from a matrix, by stacking its columns on top of each other. Let r = (k — ljn +j. 
To find the r:th (r < mn) column of y(t), which corresponds to the r:th element of 
0, i.e., the element Bj g, we differentiate (10.85b) w.r.t. this element and obtain 


p(t) = C(ql — A)“ Ejur(t) 


where Æj is the column vector with the j:th element equal to 1 and the others equal 
to 0. The rows for r > nm are handled in a similar way. 

If desired, also the initial state x9 = x(0) can be estimated in an analogous 
way, since the predictor with initial values taken into account is 


F(1B, D, xo) = Cg — A)m x05) + (qI — AYA Bult) + Dult) (10.86) 


which is linear also in x9. Here ô(żt) is the unit pulse at time 0. Moreover, the 
estimates can be improved by estimating the color of v in (10.82) and prefiltering the 
data accordingly. 

Remark: If A and C are the correct values, the least squares estimates of B 
and D will also converge to their true values, according to Theorem 8.4. If consistent 
estimates Aw and Cy are used instead. convergence of By and Dy to their true val- 
ues still holds. This follows by fairly straightforward calculations. See Vandersteen. 
Van hamme. and Pintelon (1996) for a general treatment of such issues. 


Finding A and C from the Extended Observability Matrix 


Suppose that a pr x n* dimensional matrix G is given, that is related to the extended 
observability matrix of the system. (10.84). We have to determine A and C from G. 
and we shall here consider cases of increased complexity. 


Known System Order. Suppose first we know that 


G=0, 
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so that 2* = n. To find C is then immediate: 


C = O,(1: pln) (10.87) 
Here we used MATLAB notation, meaning that M (s : £, j : k) is the matrix obtained 
by extracting the rows s,s + 1....,€ and the columns j.j + 1....,k from the 
matrix M. 


Similarly we can find A from the equation 
O,(p +1: prrl:n) = O,01: pir —1),1: n)À (10.88) 


which is easily seen from the definition (10.84). Under the assumption of observ- 
ability, O,—ı has rank n. so A can be determined uniquely. Normally (10.88) is an 
overdetermined set of equations {n? unknowns in A and npr — np equations; recall 
that r > n + 1). This is of no consequence if O, is exactly of the form (10.84). since 
any full rank subset of equations will give the same A. 


Role of State-Space Basis. The extended observability matrix depends on the choice 
of basis in the state-space representation. For the representation (10.81) it is easy to 
verify that the observability matrix would be 


0, = O,T (10.89) 


Applying (10.87) and (10.88) to O, would thus give the system matrices associated 
with (10.81). Consequently, multiplying the extended observability matrix from the 
right by any invertible matrix before applying (10.87) and (10.88) will not change the 
system estimate—just the basis of representation. 


Unknown System Order. Suppose now that the true order of the system is unknown. 
and that n*—the number of columns of G—is just an upper bound for the order. 
This means that we have 


G=0,T (10.90) 


for some unknown, but full rank. n x n* matrix 7, where also n is unknown to us. 
The rank of G is n. A straightforward way to deal with this would be to determine 
this rank, delete the last n* — n columns of G and then proceed as above. A more 
general and numerically sound way of reducing the column space is to use singular 
value decomposition (SVD): 


a 0 0... 0 
0 02 0... 0 
0 0 o3 ... 0 

Siu: Ops 


© 
© 
. © 
© 
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Here U and V are orthonormal matrices (UTU = 1, VI V = 1) of dimensions 
pr x pr and n* x n*. respectively. S isa pr x n* matrix with the singular values of 
G along the diagonal and zeros elsewhere. If G has rank n, only the first 7 singular 
values o; will be non-zero. This means that we can rewrite 


G = USV? = U,S,V) (10.92) 
where U, is a pr x n matrix containing the first n columns of U, while 5; is the 
n x n upper left part of S, and V; consists of the first n columns of V. (We still have 
VV, = 1.) From (10.90) we find O,T = U,S,V,7. Multiplying this by V; from the 
right gives 

O,TV, = O,T = US, (10.93) 


for some invertible matrix T = T Vi. We are now in the situation (10.89) that we 
know the observability matrix up to an invertible matrix T—or equivalently. we 
know the observability matrix in some state-space basis. Consequently we can use 


O, = U Sı or O, = U; or any matrix that can be written as 


a 


O, = UIR (10.94) 
for some invertible R in (10.87) and (10.88) to determine the p x n matrix C and 
the n x n matrix A. 


Using a Noisy Estimate of the Extended Observability Matrix. Let us now assume 
that the given pr x n* matrix G is a noisy estimate of the true observability matrix 


G = O,T + En (10.95) 


where Eyn is small and tends to zero as N — œ. The rank of O, is not known. 
while the “noise matrix” Ey is likely to be of full rank. It is reasonable to proceed 


as above and perform an SVD on G: / 
G = USV" (10.96) 
Due to the noise. $ will typically have all singular values oz; k = 1..... min(n*. pr) 


non-zero. The first n will be supported by O,. while the remaining ones will stem 
from Ey. If the noise is small, one should expect that the latter are significantly 
smaller than the former. Therefore determine ñ as the number of singular values 
that are significantly larger than 0. Then keep those and replace the others in $ by 
zeros, and proceed as in (10.92) to determine U; and $1. Then use O, in (10.94) to 
determine A and C as before. However. in this noisy case, O, will not be exactly 
subject to the shift structure (10.88), so this system of equations should be solved in 
a least-squares sense. 

The consistency of this process as N — oc and Ey — 0 is rather easy to 
establish by a continuity argument: As Ey tends to zero, the corresponding estimates 
of A and C will tend to the values that are obtained from (10.90) with Ey = 0. 

It is more difficult to analyze how the variance of Ew will influence the variance 
of A and Ĉ. Some results about this are given in Viberg et.al. (1993), based on work 
by T.W. Anderson. 
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Using Weighting Matrices inthe SVD. For more flexibility we could pre- and post- 
multiply G as G = W, G W- before performing the SVD 

G = WGW, = USV" = US Vf (10.97) 
and then instead of (10.94) use 


Ò, = W,'UIR (10.98) 


to determine Ĉ and A in (10.87) and (10.88). Here R is an arbitrary matrix, 
that will determine the coordinate basis for the state representation. The post- 
multiplication by W> just corresponds to a change of basis in the state-space and 
the pre-multiplication by W is eliminated in (10.98). so in the noiseless case E = 0, 
these weightings are without consequence. However, when noise is present, they 
have an important influence on the space spanned by U, . and hence on the quality of 


the estimates C and A. We may remark that post-multiplying W> by an orthonormal 
matrix does not effect the U; -matrix in the decomposition. See Problem 10E.10. We 
shall return to these questions below. 

Estimating the Extended Observability Matrix 


The Basic Expression. From (10.80) we find that 
y(t + k) = Cx +k) + Dult +k) + v(t +k) 
= CAx(t +k—-—1)+CBut +k-1D)+Cwt+k—-) 
+ Dut +k) + vit +k) 


= CA*x(t) + CA*'Bu(t) + CA*® Bult +1) + 

+ CBu(tt + k— 1) + Dutt +k) 

+ CA‘ w(t) + CAR wlt +1) +... 

+ Cwt+k—-1) +00 +k) (10.99) 


Now, form the vectors 


y(t) u(t) 
yt +1) u(t + 1) 
Y,(t)} = : ; U, (t) = ; (10.100) 
y4 +r- l) u(t +r—1) 


and collect (10.99) as 


Y,(t) = O,x(t) + S,U,(t) + Vit) (10.101) 
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with 
D 0 0 0 
CB D os 0 0 
S = : ; -. : f 
CA'™™?B CA™"*B .-» CB D 


and the Å:th block component of V(t) 
V = CAT wlt) + CAR Fur +1)... 
+ Cu(t +k—2)+vt+k-l) (10.102) 


We shall use (10.101) to estimate O,. or rather a matrix O,7 for some (unknown) 
T. The idea is to correlate both sides of (10.101) with quantities that eliminate the 
term with U,(t) and make the noise influence from V disappear asymptotically. For 
this, we have the measurements ¥(f), u(t). t = 1..... N +r — l available. It will 
be easier to describe the correlation operations as matrix multiplications. and we 
therefore introduce 


y=[Y,0) TO ... ¥(N)] 
KX =[x(1) x(2) ... x(N)] 
(10.103) 
U= [VDD U,(2) ... U(N)] 
vV=[Vd) VQ) ... VIN)] 


These quantities depend on r and N., but this dependence is suppressed. We can 
now rewrite (10.101) as the basic expression 


Y=0X+SU+V ý (10.104) 


Remark. Define as in (7.59) the vector Y of true k-step ahead predictors. 
Then it follows from (10.104) that 


¥ = O,X (10.105) 


where X is made up from the predicted (Kalman-filter) states x(t|t — 1). i.e. the best 
estimate of x(t) based on past input-output data. 


Removing the U-term. Form the N x N matrix 
Nye = 1 — uuu’) U (10.106) 


(if UUT is singular, use pseudoinverse instead). This matrix performs projection. 
orthogonal to the matrix U, i.e., 


Uy: = U — uuu") 'U = 0 
Multiplying (10.104) from the right by Tits will thus eliminate the term with U: 


Yir = OX r + Vér (10.107) 
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Removing the Noise Term. The next problem is to eliminate the last term. Since 
this term is made up of noise contributions. the idea is to correlate it away with a 
suitable matrix. Define the s x N matrix (s > n) 


® = [¢;(1) Qs (2) vee ys(N) ] (10.108) 


where ¢,(f) is a yet undefined vector. Multiply (10.107) from the right by $” and 
normalize by N: 


1 1 1 e 
C= < YN „0T = O, XM" + Wipro” = f O, Ty + Vy (10.109) 


Here Ty isan n x s matrix. Suppose now that we can find g, (t) so that 


Jim Vy = lim aca = 0 (10.110a) 
1 i 
= lim — A ; 
dim Ty slim yer?! = = T hasfullrankz (10.110b) 
Then (10.109) would read 
1 y 
G = yy Muro” = 0,T + Ey 


(10.111) 
Ey = O-(T x = T) + Vy > OasN > œX 


The pr x s matrix G can thus be seen as a noisy estimate (10.95) and we can subject 
it to the treatment (10.96)-(10.98) to obtain estimates of A and C. 


Finding Good Instruments. The only remaining question is how to achieve (10.110). 
Notice that these requirements are just like (7.119} and (7.120) for the instrumental 
variable method. Using the expression (10.106) for Nis and writing out the matrix 
multiplications as sums gives 


1 


AALT D Vy) — — D3 VDU) 
t=1 t=] 


N -1 N 
1 z 1 
x i 2 U,(t)U, ol = Do Up (10.112) 


t=l 


Under mild conditions. the law of large numbers states that the sample sums converge 
to their respective expected values 


1 = = Z 
Jim wy Vue! = EV (yl) — EV()U/(t)R,'EU,(t)g2(t) (10.113) 


where 
= EUV, MUTA) 
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is the r xr covariance matrix of the input. Now, assume that the input u is generated 
in open loop, so that it is independent of the noise terms in V (see (10.102)). Then 
EV just) = 0. Assume also that R, is invertible. (This means that the input is 
persistently exciting of order r—see Section 13.2.) Then the second term of (10.113) 
will be zero. (If the pseudo-inverse is used in (10.106). this is still true. even if R, is 
not invertible.) For the first term to be zero, we must require V(t) and ¢, (1) to be 
uncorrelated. Since V (z) according to (10.102) is made up of white noise terms from 
time ¢ and onwards, any choice ¢, (t) built up from data prior to time t will satisfy 
(10.110a). A typical choice would be l 


y(t — 1) 
y(t — sı) 
st) = i 10. 
=| ay (10.114) 
u(t — s2) 
Now. turning to (10.110b) we find by a similar argument that 
T = Ex(t)ylit) — Ex()us(t)R, EU, (el (1) (10.115) 


A formal proof that 7 has full rank is not immediate and will involve properties of 
the input. See Problem 10G.6 and Van Overschee and DeMoor (1996). 

Summing up. forming G = x YI1 +" with & defined by (10.114) and (10.108) 
gives the properties (10.111), which allows us to consistently determine A and C. via 
(10.97), (10.98). and (10.88). (10.87). We shall sum up all the steps in the algorithm 
later. í 


Finding the States and Estimating the Noise Statistics 


In (10.104) we constructed a direct relationship between future outputs and the states. 
Let us now shift the perspective somewhat and return to the prediction approach of 
Section 7.3. This will give an expression which is closely related to (10.104) and shows 
the links between states and predictors. 

Let us estimate the k-step ahead predictors. Recall (7.63). 


Y,(t) = Ogy, (t) + FU,(t) + E(t) (10.116) 
Collecting all ¢ as in (10.104) gives 
Y=OO+TU+E (10.117) 
The Least Squares estimate of the parameters is 


T rT -1 
[ô r]=[ye7 wr be ae 


10.118 
Ud? UUT ( 
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Using the expression for inverting block matrices gives (see Problem 10E.11) 
a -1 
O = Yid (NiD) (10.119) 
This means that the matrix of predicted outputs can be written 


Y 


[$w cad f) | 
YoyrO (ON jr!) ® (10.120) 


which we recognize as Gin (10.97). with G as in (10.109) and the weightings W, = / 
and W, = (® Nr" )7! $. Performing the SVD and deleting small singular values 
thus gives 


¥ x US vf (10.121) 


Here Vř isan n x N matrix. We know from (10.98) that U; is related to the 


observability matrix by some invertible matrix R as U; = O, R`!. Introduce X = 
R! SiV. Then (10.121) can be written as 


¥ = O,R'S,V) = O,X (10.122) 


Comparing with (10.105) shows that X must be the matrix of the correct state esti- 
mates if Ŷ is the matrix of true predicted outputs. The true predicted outputs are, 
however, normally obtained only when the orders s; in (10.114) tend to infinity. In 
such a case we have then found the true state estimates—in the state-space basis in 
question—from the SVD. Alternatively we can write 


A 


X 


LY = [za = aN) | 
(10,123) 
L = RUF 


since Uf U, = I from the SVD-properties. With the states given. we can estimate 
the process and measurement noises as 


w(t) = £0 +1) — A(t) — Bult) 
: . (10.124) 
v(t) = y(t) — Cx(t) — Dult) 


and estimate their covariance matrices in a straightforward fashion. Here A, B Í Ĉ : D 
are the estimates of the system matrices obtained as described above. Alternatively, 
these could be directly estimated by the least squares procedure (7.66), once the 
States are known. 


Putting It All Together 


We have now described all the basic steps of the algorithm, although in somewhat 
reverse order. The complete subspace algorithm can be summarized as follows: 
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The Family of Subspace Algorithms (10.125) 
1. From the input-output data. form 
l 
G = Yet (10.126) 
with the involved matrices defined by (10.103). (10.100). (10.106). (10.114). and 


(10.108). 
2. Select weighting matrices W; (rp x rp and invertible) and W2 ((psı +182) xæ) 
and perform SVD 


G = WGW, = USVĪ ~ US vi (10.127) 


where the last approximation is obtained by keeping the n most significant 
values of the singular values in S and setting the remaining ones to zero. (U) is 
nowrpxn. Siisn xn and vr isn xa.) ASremarked above, post-multiplying 
W by any æ x k orthonormal matrix (with k > rp) will not change U1. (See 
Problem 10E.10.) 


3. Select a full rank matrix R and define the rp x n matrix Ò, = wrlu, R. Solve 
C = 0,11: pot in) (10.128a) 
O,(p +1: pri:n)=0O,(1: pír —-1).1: nA (10.12&b) 


for C and A. The latter equation should be solved in a least squares sense. 
4. Estimate B, D and Xp from the linear regression problem: 


N 
1 ‘ s 
arg min — J fro — Cig — A)“ Bult) — Dult) 
B. D.Xy N t=1 f 


— ĉqi -= VEO) i (10.129) 


5. If a noise model is sought, form Å as in (10.123) and estimate the noise contri- 
butions as in (10.124). 


Numerical Implementation. It should be mentioned that the most efficient numeri- 
cal implementation of the above steps is to apply OR-factorization of the data matrix 
[UT ð" YT] = LQ. (L isherea lower triangular. pr + mr +s square matrix. 
while Q isan orthonormal (pr +mr +s) x N matrix.) Then the crucial SVD-factor 
U, in (10.127) can be found entirely from the “L-part” of this factorization, i.e.. the 
“small” matrix. See Problem 10G.5 as well as Van Overschee and DeMoor (1996) 
and Viberg, Wahlberg, and Ottersten (1997). 

The family of methods contains a number of design variables. Different algo- 
rithms described in the literature correspond to different choices of these variables. 
It is at present not fully understood how to choose them optimally. The choices are: 


e The choice of correlation vector y,(f) in (10.114). The requirement is that 
(10.110) holds. Notice that this choice also determines the ARX-model that 
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is used to approximate the system when finding the k-step ahead predictors in 
(7.62). Most algorithms choose ¢,(#) to consist of past inputs and outputs as 
in (10.114) with sı = s2. The scalar s is then the only design variable. 

If sı = 0 only past inputs are used, and an output-error variant of the 
algorithm is obtained. Then the noise is ignored when forming the predictors. 
which leads to models like (7.55) that do not attempt to describe the noise 
properties, but therefore also may describe the input-output properties with 
fewer states. The OE-MOESP algorithm of Verhaegen (1994)uses this choice 
of regressors. 


e The scalar r, which is the maximal prediction horizon used. Many algorithms 
use r = s, but there is no particular reason for such a choice. 


e The weighting matrices W; and W3. This is the perhaps most important choice. 
Existing algorithms employ the following choices: 


e MOESP. Verhaegen (1994): W, = 7, W: = (yO) OnE, 
e N4SID. Van Overschee and DeMoor (1994): W) = 1. 
W> = ($T) | @ (see also (10.120).) 


e IVM. Viberg (1995): W; = (vY¥Nay)'?, W.» = (top)! 


e CVA, Larimore (1990): W; = (YNY). W = Gong? 


(Note that W2 can be expressed in severa] ways. since post-multiplying with 
any orthonormal matrix does not change the resulting estimates.) The effects 
of the weightings are discussed in several papers. See the bibliography. 


è The matrix R in step 3. Typical choices are R= I, R = S or R= s. 


10.7 SUMMARY 


Determining a parameter estimate from data has two aspects. First, one has to decide 
how to characterize the sought estimate: as the solution of a certain equation or as 
the minimizing argument of some function. Second, one has to devise a numerical 
method that calculates that estimate. It is important to keep these issues separate. 
The combination of several different approaches to characterize the desired estimate 
with many techniques to actually compute it has lead to a wide, and sometimes 
confusing. variety of identification methods. Our aim in this chapter, as well as in 
Chapter 7. has been to point to underlving basic ideas. 

For linear regression problems (LS and 1V methods), we have recommended 
OR factorization-type methods (10.11) and also pointed to the possibilities of using 
Levinson and/or lattice methods [(10.23) and (10.30), (10.32) respectively] for special 
structures. 

For general PEM, we have recommended the damped Gauss-Newton iterative 
method (10.40), (10.41), and (10.46) as the basic choice. complemented with (10.79) 
to find initial values for linear black-box models. 

For subspace methods the essential parts of the numerical calculations consist 
of a QR-factorization step and an SVD. This allows very robust numerical methods 
for the calculation of the estimates. 
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10.8 BIBLIOGRAPHY 


The computation of estimates is of course a topic that is covered in many articles and 
books on system identification. The basic techniques are also the subject of many 
studies in numerical analysis. i 

For the linear least-squares problem of Section 10.1. an excellent overview is 
given in Lawson and Hanson (1974). An account of the Levinson algorithm and its 
ramifications is given, for example, in Kailath (1974). An early application of the 
Levinson algorithm for estimation problems was given by Durbin (1960). The algo- 
rithm with estimated R(T) is therefore sometimes referred to as the Levinson-Durbin 
algorithm. The multivariable Levinson algorithm was given in Whittle (1963)and 
Wiggins and Robinson (1965). A Levinson algorithm for the “covariance method” 
[for which R(N) is not a Toepliz matrix, but deviates “slightly” from this structure] 
was derived by Morf et.al. (1977). Levinson algorithm has been widely applied in 
geophysics (e.g.. Robinson, 1967; Burg, 1967) and speech processing (e.g.. Markel 
and Grav. 1976), while it has been less used in control applications. Its numerical 
properties are investigated in Cybenko (1980). 

Lattice filters are used extensively in Markel and Gray (1976). Honig and 
Messerschmitt (1984). and Rabiner and Schafer (1978). in addition to the references 
mentioned in the text. The numerical stability of the calculation of reflection coef- 
ficients in (10.30). and (10.32) has been analyzed by Cvbenko (1984). Considerable 
attention has been paid to the recursive updating of the reflection coefficients. which 
we Shall return to in Chapter 11. 

For the methods of Section 10.2, Dennis and Schnabel (1983)serves as a basic 
references. It contains many additional references and also pseudocode for tvpical 
algorithms. Variants of the Newton methods for system identification applications 
have been discussed in. for example, Astrém and Bohlin (1965), Gupta and Mehra 
(1974), Kashyap and Nasburg (1974). The gradients v= (0/00)*. (0/00)x. and 
so on. are known as sensitivity functions or sensitivity derivatives. These have been 
studied also in connection with sensitivity analysis of control design. Simple meth- 
ods for calculating these gradients in state-space models have been discussed in. 
for example, Denery (1971 )and Neuman and Sood (1972), as well as in Gupta and 
Mehra (1974) and Hill (1985). The use of Lagrangian multipliers to reduce the com- 
putational burden is described in Kashyap (1970)and van Zee and Bosgra (19821. 
Another possibility is to apply Parseval's relationship to V,, and Ay in (10.39) and 
(10.45) and evaluate these in the frequency-domain in terms of the Fourier trans- 
forms of the signals. See Hannan (1969)and Akaike (1973). The expressions follow 
easily from (7.25) and (9.53). A special technique for maximizing the likelihood 
functions. the EM algorithm, has been developed by Dempster. Laird. and Rubin 
(1977). See Problem 10G.3. 

Further results on the uniqueness of solutions are given in Chapter 12 of Söder- 
ström and Stoica (1989). 

Analysis of bootstrap methods is carried out in Stoica et.al. (1985). Two- 
and multistage methods have been discussed in many variants. In addition to those 
described in Section 10.4. there are, for example. the well-known methods of Durbin 
(1959)and Walker (1961). Both start with a high-order AR model. The coefficients 


Sec. 10.9 Problems 353 


of these models (the corresponding covariance functions in the Walker case) lead 
to a system of equations for solving for MA parameters. See also Anderson (1971), 
Section 5.7.2. Other techniques to build models by reduction of high order ARX- 
models are described in Wahlberg (1989)and Zhu and Backx (1993). 

The Subspace methods really originate from classical realization theory as for- 
mulated in Ho and Kalman (1966)and Kung (1978). These algorithms pointed to 
the essential relationships (10.88) and (10.87). The extended observability matrix 
can also be found bv factorizing the Hankel matrix of impulse responses, and several 
identification methods based on this have been devised. like King, Desai, and Skel- 
ton (1988). Liu and Skelton (1992). Larimore developed his algorithms in Larimore 
(1983), Larimore (1990)inspired by Akaike’s work on canonical correlation. Akaike 
(1974b)and Akaike {1976). Related algorithms were developed by Aoki in Aoki 
(1987)for the time-series case. 

The family of methods developed with the related approaches by Moonen 
et.al. (1989), Verhaegen (1991). lead to the basic presentations Verhaegen (1994)and 
Van Overschee and DeMoor (1994). The relationships between the approaches 
have been pointed out by Viberg (1995)and Van Overschee and DeMoor (1996), 
which can be recommended as general overviews. Statistical analysis is presented in 
Peternell. Scherrer, and Deistler (1996). Special techniques to handle closed loop 
data are described in Chou and Verhaegen (1997). The presentation in Section 10.6 is 
largely based on Viberg. Wahlberg. and Ottersten (1997). with a similar perspective 
described in Jansson and Wahlberg (1996). 

Subspace methods to fit frequency-domain data are treated in McKelvey. Ak- 
cay, and Ljung (1996). 


10.9 PROBLEMS 


10G.1 Let z(t) be a p-dimensional signal. and let a” and br (p x p matrices) be the least- 
squares estimates of the linear regression models 


3"(1]0) = —atz(t — 1) — +- — atz(t — n) 
(10.130) 
2"(t — n — 1|0) = —bictt — n) — - — bier — 1) 
based on data <(f). 1 < t < N. Show. by arguments analogous to (10.15) to (10.24). 
that these estimates can be computed as 


antl oan apn antl _ aâ 
âk = ak + Pabn- Onyi = Pn 
pat+l _ pn ban pnt] _ ob 
b; ET b; + Pnin-k+1* bayi = Pr 
n n 
be Pi T 
On = Rag + XO åg Rayi- Ba = R-n- + SOO Ron-ink: (= @,) 
k=1 k=1 
a 22, a 6)-1 b = hb aj-l 
Vea = Ve — lve Ba. Vee = Va — Bal Vy] œn 


f= lV" op = Pal VT 


3 
ii 
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Here 
N +n 


Š ZTF. 
Ri = W $o ae + k)z (ft) 


t=-n 


with z(t) = 0 outside the interval 1 < £ < N. (This is the multivariable Levinson 
algorithm, as derived by Whittle, 1963.) 
10G.2 Steiglitz-McBride method: Steiglitz and McBride (1965)have suggested the following 
approach to identify a linear system subject to white measurement errors: Consider 
the OE model (4.25) 
B(q) 


y(t) = Fa” + e(r) 


Step I. Apply the LS method to the ARX model 
F(q)y@t) = B(q)u(t) + elt) 


This gives By(q) and Fiq). 
Step 2. Filter the data through the prefilter 
1 1 


yr) = = y(t). ur(t) = = u(t) 
O Ena) Bis Fg) 


Step 3. Apply the LS method to the ARX model 
F(q)yr(t) = Biqjur (t) + elt) 
Repeat from step 2 with the new F estimate. Stop when Fy and By have converged. 


(a) This method can be interpreted as a wavy of ae by a correlation estimate as 
in (7.110) with a 9-dependent prefilter. What is the correlation vector ¢ (t. 4) 
and what is the prefilter L(q¢, 9)? 


(b) By what numerical technique (according to the classification of this chapter) is 
the estimate computed? 


(c) Suppose the numerical scheme converges to a unique solution By, Fa of the 
correlation equation. Use Theorem 8.6 to discuss whether these estimates will 
be consistent if the true system is described by 


Bo(q) 
v(t) = ——ul(t) + v(t) 
Fo(q) 
where (vo(t}} is white or colored, respectively, noise. In case {vo(t)} is white. 
what does Theorem 9.2 say about the asymptotic variance of the estimate? (Sec. 
also. Stoica and Söderström, 1981a. for the analysis.) 


10G.3 The EM algorithm: Consider Problem 7G.6 and the expression (7.161) for the neg- 
ative log likelihood function: 


V(0) = —log p(¥|0@) = —log p(Y, X10) + log p(X|6. Y) 
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This expression holds for all X, and can thus be integrated over any measure 
for X. f(X). without affecting the X -independent left side: 


V(6) = -f log p(Y, XI@)f(X)dX +f log p(X|6, ¥) f(X) dX 
XER” XeR” 


Let now in particular f (X) be the conditional PDF for X , given Y and assuming 
0 =a: 


f(X) = p(XIY. a) 
Then 
V (0) = H,(Y.0,a@) + H(Y.6, a) 


A(Y.6.a) = -f log p(Y, X|0)p(X|Y.æa)dX = E(— log p(Y, XIO |Y. a) 
XeR” 


€ 
H,(Y.0.a) = f ospa, 0) - p(XIY,a) dX 


The EM-algorithm (Dempster, Laird, and Rubin, 1977) for minimizing V (@) consists 
of the following steps: 


1. Fix a, and determine the conditional mean of — log p(Y, X|@) with respect 
to X. given Y under the assumption that the true value of 8 is ay. This gives 
A(Y.@.a,). (Note that the 6 in p(Y, X|@) is left as a free variable.) 


2. Minimize 
AY, 6, a) 
with respect to @ giving ôr. 
3. Set a4) = 6; and repeat from 1. 


(a) Now. show that 
HCY, Oy, œx) < H(Y, ak, k) 
and that hence 
VÊ) < Viar) 


The algorithm thus produces decreasing values of the negative log like- 
lihood function. 


(b) Write out the EM-aigorithm applied to the case of Problem 7G.6. 


[Step 1 is the Estimation and step 2 the Minimization step in the EM-algorithm. 
The algorithm is useful when the likelihood function given Y is complicated and the ad- 
dition of some auxiliary measurements X would have given a much simpler likelihood 
function. We thus expand the problem with these fake measurements and average 
over them using their conditional density given the actual observations and that the 
system is described by the current @-estimate. Note that H2(Y, 9, œ) is never formed 
in the algorithm.] 
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10G.4 Let Y and U be defined by (10.103). Consider the following LQ-factorization 
d= [co llor 
Y] Lla Le Q? 


QT fio 
[or jte e=; al 


Yogr = YU — UUUT)'U) = Ln @? 


where 


Show that 


Hint: Just plug in the factorized expressions for U and Y. 
10G.5 Let Y.U and © be defined by (10.103) and (10.108). Consider the LQ-factorization 


U Ly, 0 0 OH 
ẹ | = | Lana Lx» 0 0 |. Q =1. 
Y Ly Ly Lay r 


Show that 


T 
(a) Y/u® = YN (ONDO = Laly lla La] H 
(b) Y/ut;r = L207 
Here the big 0-part of Z and the corresponding rows of Q of the original LQ- 
factorization have been thrown away. You may assume that indicated inverses exist. 

Hint: Compare with the previous exercise. 

(The notation Y /r® for(10.120) is due to Van Overschee and DeMoor (1994)and 
is to be read: “The oblique projection of Y onto the space ®. along the row space of 
U“.) l 

Note that (a) corresponds to the N4SID choice of G according to (10.129). 
Moreover (b) corresponds to the matrix G used in MOESP. Finally, note that according 
to Problem 10E.10. vou can always post-multiply the matrices with an orthonormal 
matrix, without affecting the factor U, in the SVD (10.127). Hence we can work 
entirely with the “L-parts” of the factorizations above, and throw away the (big) Q- 
parts. when performing the calculations in (10.125). 


10G.6 Show that (10.115) has full rank » provided 
1. Eg,(t)p!(r) is positive definite. 


x(t) r Trex: ae 
2. E UD [x‘(t) U; (t) ] is positive definite. This means that the r future 


inputs should not be linearly dependent on the current state. 


3. sı and s, are sufficiently large so that x(t) ~ L,y,(1) for some L,. (See 
(10.123)). 


Hint: Use that Ex(r)p7 (t) = ER (Hy). where X(t) is that part of the state that can 
be reconstructed from past input-outputs. Similarly Ex(1)U7 (t) = ENU G). 
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10E.1 In Hsia (1977). Section 6.7. the following identification procedure is described. Let 
g(t) = [ -xa — 1)... —y(@t — na} uit — 1)... u(t — ny) | 
p7 = [a)...an, bi.. bnp] 
The model is then written as 
ya) = přip tet) 


The estimate py is computed as a “bias correction” 


bx = px ANS (10.131) 
where ô} and J813 are computed iteratively as follows: 
Step l. Let 

Ke 
Ri = z Y aoii) 
t 1=1 
and 
ALS wp 1 : ; 
Py = [ | gL eosin 
pels = 0 
Step2. Let 

pn = pis = po 

Step 3. Let 


e(t) = y(t) — o7 (py 


and define 


gi (rt) = [—e{r) Suaa —£(t — na)] 
Let 
EE 
RẸ = z P aen 
g t=1 
1 N 
Re = T X ynei) 
7 t=! 
Compute 


4 a 7 -1 a 
v= RE = [RE] [RE] ae 


N N 

A -1 t)2 = 1 7 r =i 1 - g 

pplas = [r] RY? D7! È X panye) — [R] [ry] z Daso 
i=l 1=1 


and repeat from step 2 until convergence. 
Now show that this procedure is the bootstrap algorithm (10.64) for the PLR 
method (7.113) using the ARARX model (4.22). 


(w, 
li 
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10E.2 Consider the model structure of Problem 5E.1. Give an expression for how to compute 
the gradient 


d . 
YO) = zgi) 


19E.3 Apply the Gauss-Newton method (10.40) and (10.46) to the linear regression prohlem 
(10.1) with quadratic criterion. Take u = 1. 


10E.4 Introduce the approximation 
e(t.6) © e(t. ÂV) + wl, A! yg — 64-1) 


Use this approximation in 
l N 
VKO. ZY) = -X euo 0.132 
ní ) 22 (7.0) (10.132) 


and solve for the minimizing 0. Show that this gives the (undamped) Gauss-Newton 
method (10.40) and (10.46). 


10E.5 Consider Problem LOE.4. Minimize (10.132) subject to the constraint 
ae 
Discuss the relationship to the Levenberg-Marquardt method (10.47). 


10E.6 Let V, be defined by (10.23) and (10.24). Show that 


N 


N 
1 ; N N a. 
V, = rp ae HCl) = Le 


t=1 


with $” given by (10.15). 2 
10E.7 Consider the ARX model 


yt) + ayl — 1) +--+ + ayit — n) = but — 1) +--+ + brult — n) + elt) 


~y(t) 
wah i] 


and show how the estimates of a; and b; can be computed using the multivariable 
Levinson algorithm of Problem 10G.1. 


Introduce 


10E.8 Show that. in the lattice filter. we have 
$n (t}0") = —pyF(t — 1) — port — 1) — e — palalt — 1) 


Compute the covariance matrix of fj, i=1..... n. 


10E.9 Apply the method (10.68) to the ARARX model (10.66). Spell out the steps explicitls 
and show that they consist of a sequence of LS problems mixed with simple filtering 
operations. [This is the generalized least squares(GLS) method. developed by Clarke 
(1967). See also Söderström (1974). | 
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10E.10 Let the p x N matrix G be given. with p < N. Let its SVD be 
G = USV" 
(U is p x p with UTU = I.and V is N x N with VV = J] and SisapxN 
matrix with the singular values along the diagonal, and zeros elsewhere.) Suppose 
p <r <N and W isa N xr matrix such that W7 W = /. Let 


GW = US Vf 


be the SVD of GW. Show that U = U,. (Hint: Note that. with MATLAB notation, 
S*V'=S(:,1:r)*V(:,1:r)’. Then use that W *V(:,1:2r) will be orthogonal.) 


10E.11 The block matrix inversion lemma says: 
A DY] AT! —A-'DB" 
C B] — [-B ca" B- + B-'CAIDB 7! 


A = A — DBC 


where 


Apply this to the matrix in (10.118); show that A becomes ® Nis $7 and that (10.119) 
holds. 


10T.1 Householder transformations: A Householder transformation is a matrix 
Q = l — 2wuw" 
where w is a column vector with norm 1. Show the following 


(a) Q is symmetric and orthogonal. 


(b) Let x be an arbitrary vector. Then there exists a Q as in part (a) such that 


(c) Let A beann x m matrix, n > m. Then there exists an orthogonal matrix 


Q = OmQm-1... Qı 


being a product of Houscholder transformations such that 


where R is a square (m x m) upper triangular matrix (see Lawson and Hanson. 
1974). 
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10T.2 


Consider the svstem 
Ag(g)v(t) = Byl(gyu(t) + Culg)eol) 
and an ARMAX model structure 
o = [a...an bi- bnp C1- E 


Alg)¥(t) = Bigu) + C(q)ett) 
with polynomial orders larger or equal to those of the true system. Let 
e Ce”) 
Da = 40 IR = OV: 
M | ! Tew) > of 


Show that the prediction-error criterion 
V(0) = Ee*(t. 6) 
has no false local minimum in 6 € Dat: that is. 


Veni, ee oe) 2 oo) 
Alq) — Ao(q) Aq) — Ao(q) 
Consider the method (10.68) to minimize (10.67) for a bilinear parametrization. Write 
(10.68) as an update step (10.40) with a block-diagonal Ry matrix. It is thus indeed 
a descent method that will converge to a local minimum. 


10D.1 Verify the relationships (10.31). Hint: By definition. 


N 
Dane —k)=0 l<k<n 
t=1 


[the residuals are uncorrelated with the regressors: see Figure II.1]. 


10D.2 Use Problem 10G.1 to derive a lattice filter for a multivariable signal z(t). 
f 


11 


RECURSIVE ESTIMATION 
METHODS 


11.1 INTRODUCTION 


In many cases it is necessary, or useful, to have a model of the system available on-line 
while the system is in operation. The model should then be based on observations 
up to the current time. The need for such an on-line model construction typically 
arises since a model is required in order to take some decision about the system. This 
could be 

e Which input should be applied at the next sampling instant? 

e How should the parameters of a matched filter be tuned? 

e What are the best predictions of the next few outputs? 

e Has a failure occurred and. if so. of what type? 
Methods coping with such problems using an on-line adjusted model of some sort 
are usually called adaptive (see Figure 11.1). We thus talk about adaptive control. 
adaptive filtering. adaptive signal processing, and adaptive prediction. 


MODEL 


Figure 11.1 Adaptive methods. 
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The on-line computation of the model must also be done in such a way that the 
processing of the measurements from one sample can, with certainty. be completed 
during one sampling interval. Otherwise the model building cannot keep up with 
the information flow. 


Identification techniques that comply with this requirement will here be called 
recursive identification methods, since the measured input-output data are processed 
recursively (sequentially) as they become available. Other commonly used terms 
for such techniques are on-line or real-time identification, adaptive parameter es. 
timation, or sequential parameter estimation. Apart from the use of recursive 
methods in adaptive schemes, they are of importance also for the following two 
reasons: 


1. Typically, as we shall see. they will carry their own estimate of the parameter 
variance. This means that data can be collected from the system and processed 
until a sufficient degree of model accuracy has been reached. 


2. The algorithms of this chapter will also turn out to be quite competitive alter- 
natives for parameter estimation in off-line situations. See Section 10.3. 


In this chapter we shall discuss how recursive identification algorithms can be con- 
structed, what their properties are, and how to deal with some practical issues. We 
start by formally describing the requirement of finite-time computability. 


Algorithm Format 
We defined a general identification method as a mapping from the data set Z' to the 
parameter space in (7.7): r 

6, = F(t. Z') (11.1) 


where the function F may be implicitly defined (e.g., as the minimizing argument 
of some function). Such a general expression (11.1) cannot be used in a recursive 
algorithm, since the evaluation of F may involve an unforeseen amount of calcula- 
tions, which perhaps may not be terminated at the next sampling instant. Instead. a 
recursive algorithm must comply with the following format: 


X(t) = H (t, X(t — 1). y(t), u(t)) 
; (11.2) 
a, = h(X(t)) 


Here X(t) is a vector of fixed dimension that represents some “information state.” 
The functions H and hk are explicit expressions that can be evaluated with a fixed 
and a priori known amount of calculations. In that way it can be secured that 6,..can 
be evaluated during a sampling interval. 
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Since the information content in the latest pair of measurements. \(7), u(t), 
normally is small compared to the information already accumulated from previous 
measurements, the algorithm (11.2) typically takes a more specific form: 


6, = 6-1 + VQe (X). y(t), uD) 
X(t) = X(t — 1) + wy Ox (Xt — 1). vt) u(t) 


(11.3) 


where y and u are small numbers reflecting the relative information value in the 
latest measurement. 


11.2 THE RECURSIVE LEAST-SQUARES ALGORITHM 


In this section we shall consider the least-squares method as a simple but archetypal 
case. The obtained algorithms and insights will then also serve as a preview of the 
following sections. 


Weighted LS Criterion 


In Section 7.3 we computed the estimate that minimizes the weighted least-squares 


criterion: 
a J * 
6, = arg min X gik) [y(k) — eke} (11.4) 
9 k=l 
This is given by (7.41): 
6=R Wf) (11.5a) 
i 
Ri) = D> pi. Week) (11.5b) 
k=l 
t 
f(t) = D> Be. kyg(k)y(k) (11.5c) 
k=l 


To compute (11.5) as given. we would at time ¢ form the indicated matrix and 
vector from Z‘ and then solve (11.5a). If we had computed 6,_; previously. this 
would not be of any immediate help. However, it is clear from the expressions that 
6, and @,_; are closely related. Let us therefore try to utilize this relationship. 
Recursive Algorithm 


Suppose that the weighting sequence has the following property: 


Bt, k) 
Bt. t) 


AC) BC — 1, k). O<k<tr-1 
1 


(11.6) 


364 


Chap. 11 Recursive Estimation Methods 


This means that we may write 


H 


pak = Ja) (11.7) 


k+1 


We shall later discuss the significance of this assumption. We note, though. that it 
implies that 


R(t) = AORE — 1) + ge" (11.8a) 
fO = afa — 1) + pty) (11.8b) 
Now 
ô = ROSO = R7 OPMSFE — D + peyro) 
= R(t) [AOR — 6-1 + TONOJ 
= RO) {[RO - p00] + poo] 
= ĝ-ı + R (Dyli) [o x pn] 
We thus have 
6 = 6, +R Mott) po = g™nÂ-] (11.9a) 
R(t) = MADRE — 1) + ge") (11.9b) 


which is a recursive algorithm, complying with the requirefnent (11.2): At time ¢ — 1 
we store only the finite-dimensional information vector X (t — 1) = [@,-,, R(t — L)]. 
Since R is symmetric, the dimension of X is d + d(d + 1)/2. At time t this vector 
is updated using (11.9), which is done with a given. fixed amount of operations. 
Version with Efficient Matrix Inversion 


To avoid inverting R(t) ateach step. it is convenient to introduce 
P(t) = R(t) 
and apply the matrix inversion lemma 
[A+ BCD]! = A! — A 'BIDA'B +c") 'Da" (11.10) 
to (11.9b). Taking A = A(t) R(t — 1), B = D7 = g(t). and C = 1 gives 


P(t — Dene (Pt — 2] 


(11.11) 
At) + p(t) P(t — 1l) 


1 
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Moreover, we have 
1 a 1 P(t — lg(nge(t)PUr — Lett) 
— P(t — 1)g(t) — —--~$__* Fe 
A(t) w(t) Al) + GP) Pt — loit) 


P(t — Ijg(t) 
A(t) + oP — Dolt) 


R yt) 


We can thus summarize this version of the algorithm as 


ô) = ôu — 1) + LQ) [o — gna — D] (11.12a) 
P(t — Delt) 
Mt) + PDP — Delt) 


Pa — Dolo t) P(t — 1) 
A + TOPU — Dolt) 


Lit) = (11.12b) 


P(t) = [Pe -1)- 


| (11.12c) 


AD) 


Here we switched to the notation 6(r) rather than 6, to account for certain differences 
due to initial conditions (see the following). 


Normalized Gain Version 


The “size” of the matrix R(t) in (11.5b) and (11.9) will depend on the A(t). To clearly 
bring out the amount of modification inflicted on 6,_; in (11.9a). it is instructive to 
normalize R(¢t) so that 


t —l 
R(t) = y(t)R(t). y(t) = PJ (11.13) 
k=l 


Notice that 


ee a a (11.14) 
y(t) y(t — 1) 


according to (11.6), and that R(t) now is a weighted arithmetic mean of p(k)" (k). 
From (11.9b) and (11.14). 


R(t) = y4) [ae 
Y 


_! R 
(t — 1) 
Ra — 1) + ye) [gine — RE - 1] 


(t—1) + one] 
(11.15) 


Then (11.9) can be rewritten as 
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elt) = y(t) — g(t — 1) 


A(t) = Ôl — 1) + vityROW)g(te(t) 
R(t) = R(t — 1) + y [pne — RU - 1)] 


Notice that e(t) is the prediction error according to the current model. Since Rit) 
is a normalized matrix, the variable y(1) can be viewed as an updating step size or 
gain in the algorithm (11.16). Compare also with (11.3). 


Initial Conditions 

To use the recursive algorithms. initial values for their start-up are required. The 
correct initial conditions in (11.9) at time t = 0 would be R(0) = 0. Ê; arbitrary, 
according to the definition of R. These cannot, however. be used. A possibility 


could then be to initialize only at a time instant fy when R(to) has become inveruble 
(typically t) > d) and then use 


to 
P(t) = R(t) = $ Blo. Doky) (11.17) 
k=i 
to 
ô, = Plt) >> L(t. Kplky(k) (11.18) 
k=1 


A simpler alternative is, however, to use P(0) = Po and 6(0) = 0; in (11.12). This 
gives p 
/ -1 
t 
ôl) = [seor + Y se. opita | 


k=1 


x oo + Ae. opr (11.19) 


k=1 


where f(z, 0) is defined by (11.7). Clearly, if Po is large or ¢ is large, then the 
difference between (11.19) and (11.5) is insignificant. 


Multivariable Case («) 
Consider now the weighted multivariable case [cf. (7.42) and (7.43)] 


6, = arg min SIA [ y(k) — ge] Ar [y(k) — go" (k)6] (11.20) 
7 k=l 


where (r, k) is subject to (11.6). Entirely analogous calculations as before give the 
multivariable counterpart of (11.12), 
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ĝt — 1) + Lit) [o — g (16a — n| 


PE — DD [ANAr + POPE — D] (11.21) 


— PI) PiE DEN AMA +p OPE DE| pT Pa-— 1) 
A(t) 


and of (11.16): 
elt) = y(t) — gO — 1) 
A(t) = Ê — 1) Fy (DRUDA ' el) (11.22) 
R(t) = R0 — 1) + yO [PDA y O -— RG - 1] 


Notice that these expressions are useful also for a scalar system when a weighted 
norm with 


Birk) = a | [a (11.23) 


k+1 
is used in (11.4). The scalar œx then corresponds to A;'. 
Asymptotic Properties of the Estimate 


Since 6(1) computed using recursive least squares (RLS) differs from the off-line 
counterpart at most by the initial effects. as shown in (11.19), the asymptotic prop- 
erties will coincide with those discussed in Chapters 8 and 9. 

Kalman Filter Interpretation 

The Kalman filter for estimating the state of the system 


x(t +1) = F(t)x(t) + w(t) 


(11.24) 
y(t) = A(t)x(t) + v(t) 
is given by (4.94) and (4.95). The linear regression model 
EIO) = g"(1)0 
underlying our calculations can be cast into the form (11.24) by 
Ət +1) = Of). (= 0) 
(11.25) 


y(t) = PDO) + vlt) 


Applying the Kalman filter to (11.25) with F(t) = 7. H(t) = git), R(t) = 0 
[= Ew(t)w'(r)]. and Ev(t)v (t) = R(t) now gives exactly (11.21). with A(t) = 1 
and A, = R(t}. 
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This gives important information, as well as some practical hints: 


pà 
e 


If the noise u(t) in (11.25) is white and Gaussian. then the Kalman filter theory 
tells us that the posterior distribution of O(t), given Z'~'. is Gaussian w a 


mean value 6(t) and covariance matrix P(t). given by (11.21), with Air) = 1 
and A, = R2(f). 


2. Moreover. the initial conditions can be interpreted so that 6(0) is the mean 
and P(Q) is the covariance matrix of the prior distribution. In plain words. this 
means that 6(0) is what we guess the parameter vector to be before we have 
seen the data, and P(0) reflects our confidence in this guess. 


3. In addition, the natural choice of the norm A, in the multivariable case is to 
let it equal the equation error noise covariance matrix. If, in the scalar case. 
a, | = Ev*(r) is time varying, we should use (k.k) = ax in the Weighted 
criterion (11.4) [cf. (11.23)]. 


Coping with Time-varying Systems 


An important reason for using adaptive methods and recursive identification in prac- 
tice is that the properties of the system may be time varving and that we want the 
identification algorithm to track the variations. This is handled in a natural way in 
the weighted criterion (11.4) by assigning less weight to older measurements that 
are no longer representative for the system. This means, in terms of (11.6). that we 
choose A(j) < 1. In particular, if A(j) = A, then 


Bit.k) = a~ (11.26) 
f 
and old measurements in the criterion are exponentially discounted. In that case. 2 


is often called the forgetting factor. The corresponding y(t) will then. according to 
(11.14). be 


y@)=y=1-a (11.27 


These choices have the natural effect on the algorithm (11.12) or (11.16) that the step 


size or gain will not decrease to zero. This issue is discussed in more detail in Section 
11.6. 


Another and more formal alternative to deal with time-varying parameters is 
to postulate that the true parameter vector in (11.25) is not constant. but varies like 
a random walk: 


O(t +1) = O(t) + wit). Eww) = R(t) (11.28) 


with w white Gaussian and Ev? (t) = R(t). The Kalman filter then still gives the 
conditional expectation and covariance of 0 (t) as 
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ô) = 6a — 1) + Li) [yo ~ gM (nar — 1)] (11.29a) 


P(t — 1)g(t) 


L) = —_————— 
”? R(t) + p P(t — Ie(t) 


(11.29b) 


_ P(t- De@e(P@ — 1) 


= PCN Rp + oP — De 


+ R(t) (11.290) 


We see in this formulation that it is the additive R, (7) term in (11.29c} that prevents 
the gain L(t) from tending to zero. 


11.3 THE RECURSIVE IV METHOD 


The IV estimate for fixed (not model dependent) instruments is given by (7.118). 
Including weights as in (11.5) gives 


BY = ROSO (11.30) 


with 


t 
Rit) = $ pe, Delko ™k) 


k=1 


t 
fi) = Ba HEM y® (11.31) 


k=1 


This is closely related to the formulation (11.5), and the recursive computation of 
61% is quite analogous to that of 61>. The counterpart of (11.12) is 


ô) = 6 — 1) + L(t) [rw — gl (nat — D] (11.32a) 


P(t — 1S) 


L = ev 
. Att) + ot) P(t — Dlt) 


(11.32b) 


P(t — DEY P(e — 1) 


1 
P = ——-|P(t -—1) —- 
(t) = l (t — 1) A) + T(t) P(t — ele) 


A(t) 


| (11.32c) 
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Asymptotic Properties 


Apart from possible initial-value effects, 6(t) computed as in (11.32) coincides with 
its off-line counterpart (11.30). Hence its asymptotic properties are given by the 
analysis in Chapters 8 and 9. 


11.4 RECURSIVE PREDICTION-ERROR METHODS 


Analogous to the weighted LS case, let us consider a weighted quadratic prediction- 
error criterion 


t 1l : 2 
V0, Z') = yt)= D> Blt, Kerk, 8) (11.33) 
2 k=1 
with 8 and y given by (11.6) and (11.13). Note that 


Yi ropte, k) =1 


k=1 
and that the gradient w.r.t © obeys 


t 
V/@.Z') = =y) È PU, kyy (k, Oelk. 0) 
k=1 
= y(t) ho- 0 2 — H(t, 8)00.8)] (11.34) 
y(t — 1) 
= Val, Z) + yE [-WG, O)e(t. 0) — V0. Zz] 
just as in (11.15). / 


For the prediction-error approach, we developed the general search algorithm 
(10.40): 


Aaj Ajj : . a rAlie 

6? = 6 1) = u [rR] V, Ch n: Z') (11.35) 
Here the subscript ¢ denotes that the estimate is based on ¢ data (i.e.. Z’). The 
superscript (i) denotes the /th iteration of the minimization procedure. 


Suppose now that, for each iteration 2, we also collect one more data point. 
This would give an algorithm 


ay O Peels aia 
OPS T R CESTZ (11.36) 
For easier notation, we introduce 
ô) = 6, R(t) = R® (11.37) 


We now make the induction assumption that 6(t—~1) actually minimized V,_,(@.Z'7') 
so that 


vi Ât- 1).z'') = 0 (11.38) 
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(this will of course be an approximation). Then. we have, from (11.34). 


V/(6(t — 1). Z!) = —y (nwt. ôa — Iye(t, A(t — 1) (11.39) 


With this approximation [and taking u(t) = 1]. we thus arrive at the algorithm 


A(t) = ôa — 1) + y(@R wt, Ot — Dyer. ôt — 1) (11.40) 


We shall discuss the choice of R(t) shortly. but our main concern now is with the 
variables y(t. 6(t — 1)) and e(t. ĝ(t — 1)). These are derived from the prediction 
$(tlê(t — 1)). In general , the computation of $(t]8) for any given value of @ 
requires the knowledge of all the data Z'~'. For finite-dimensional linear models, 
this means that ‘(t|@) is obtained as the output of a linear filter whose coefficients 
depend on @. See (10. 56) and (10.60) for a conceptual expression. This means that 
Y(t, A(t —1)) and + (tlê(t — 1)) cannot be computed “ recursively” (i.e.. with fixed- 
size memory). Instead we have to use some approximation of these variables. The 
following approach is natural: 


Inthe time recursions defining W(t, 0) and ¥(t\|9) from Z' forany given @, 
replace, at time k, the parameter 8 by the currently available estimate 6(k). 
Denote the resulting approximation of Wit. 6(t — 1)) and KAC —1)) 
by W(t) and V(t). (11.41) 


For a finite-dimensional, linear, and time-invariant model (10.60), the approximation 
(11.41) takes the form 


E(t +1) = A@))EH) + BOONE) (11.42a) 
5O | L côu 1))E(t) (11.42b) 
vi | 


For the Gauss-Newton choice (10.45) and (10.46) of R(N), the rule (11.41) suggests 
the following approximation: 


Rt) = v(t) DB. yw) (11.43) 


k=] 


Using (11.41) and (11.43) in (11.40) now gives the recursive scheme 


elt) = y(t) — ¥(t) (11.44a) 


6(t) = ÔM — 1) + y(t)R DYE) (11.44b) 
R(t) = RE —- D+ yO OD- RE- DY) (11.44c) 
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The resulting scheme (11.44) together with (11.42) is a recursive Gauss-Newton 
prediction-error algorithm. 


Family of Recursive Prediction-error Methods 


Depending on the underlying model structure, as well as on the choice of Rir), 
the scheme (11.44b) corresponds to specific algorithms in a wide family of methods, 
which we shall call recursive prediction-error methods (RPEM). For example. the 
linear regression 


$al) = gl ne 


gives w(t.0) = w(t) = g(t). and (11.44) is indeed the recursive least-squares 
method (11.16). A gradient variant [R(t) = I] applied to the same structure gives 


6(t) = ôu —1) + yino(ne(r) (11.45) 


where the gain y(t) could be a given sequence or normalized as 


? 


y (t) 

lpn)? 
This scheme has been widely used. in particular for various adaptive signal-processing 
problems, under the name LMS (least mean squares) by Widrow and co-workers. 


See Widrow and Stearns (1985). For ARMAX models, we have the following exam- 
ple: 


y(t) = (11.46) 


Example 11.1 Recursive Maximum Likelihood 


Consider the ARMAX mode] (4.15). Introduce g(t. 8) as in (4.20). Then 
S(O) = p0, 0)0; (1.8) = y(t) — F216) 
Y, 0) tat — 1.0) +- +o Wt — n.) = Plt, 8) 
[see (10.52)}. The rule (11.41) then gives the following approximations: 
EM) = y(t) — g(r) 6(r) (11.47) 
y(t) = [-y@@—1)...-ylt— ng) u(t -1)...ut—np) FUE)... BE — ne) J 


$0) = @ HOE -— 1: et) = y(t) — Hr) 


. (11.48) 
wir) tet — Det 1) +--+ + en. — DYE — ane) = Gt) 
and the algorithm becomes 

A(t) = 6 — 1) + y (ÐR Ow Het) (11.49) 


This scheme is known as recursive maximum likelihood (RML). 
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Notice the difference between (the “prediction error”) e(t) and (the “resid- 
ual“) E(t). The latter enters the vector y(t + 1) and is not required until after A(t) 
is computed. Hence it is natural to distinguish between the quantities as indicated 
(cf. Section 5.11, Ljung and Söderström. 1983). E 


Similarly. the Gauss-Newton RPEM applied to the ARARX structure (4.22) 
gives another recursive ML method derived by Gertler and Banyasz (1974). while the 
same algorithm with an enforced block diagonal structure of R(t) is called recursive 
generalized least squares (RGLS) and was introduced by Hasting-James and Sage 
(1969). See also Table 11.1 in Section 11.5. 

Applied to state-space models, the RPEM is closely related to the well-known 
extended Kalman filter (EKF). as pointed out in Ljung (1979a). The algorithm 
(11.44b) thus contains a rich collection of specific. “named.” methods as special cases. 
One of its main advantages is also its general applicability. The only requirement on 
the model structure is the computability of the gradient y. 


Projection into Da 


The model structure is well defined only for @ € Day, giving stable predictors [cor- 
responding to the set of 8, for which the matrix A in (11.42) is stable]. In off-line 
minimization of the criterion function. this must be kept in mind as a constraint. The 
same is true for the recursive minimization (11.44). The simplest way of handling 
this problem is to project the estimates into Dm. for example. by 


ĝ'i) = OG — 1) + pCR" (els) 


O — 1) if6(t) é Dm 


The extra computational burden involved in (11.50) is the stability test of 
whether 6'(r) € Dm. It turns out that for successful operation of (11.44) a test 
of the kind (11.50) is necessary. However, experience also shows that the projection 
typically takes place at only a few samples in the beginning of the data record. The 
information loss by ignoring certain samples. as in (11.50), is therefore moderate. 


(11.50) 


Asymptotic Properties 


The recursive prediction-error method (11.44) is designed to make updates of 6ina 
direction that “on the average” is a modified negative gradient of 


V(6) = LEE (t.0) 


d 

dé 
It is thus reasonable to expect that 6(r) would converge to a local minimum of V (0). 
This is in fact the case (Ljung. 1981) under certain regularity conditions. Moreover. 


for a Gauss-Newton RPEM. with y(t) = 1/t. it can be shown that 6(t) has an 


V(@) = —Ew(t. @)e(t, 8) 
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asymptotically norma! distribution, which coincides with that of the corresponding 
off-line estimate [see (9.17)]. We thus have 


e lf R(t) > ôl. > 0. and y(t) —> Oast — œ, then, w.p. 1, A(t) converges to 
a local minimum of V (6) = }Ee° (t, 0). [Measures to ensure R(t) > ôl are 
called regularization and are discussed in Section 11.7.] 


e Suppose that S € M [see (8.10)] and that 6(t) converges to the true parameter 
8). Suppose that the Gauss-Newton RPEM (11.44b and c) is used with y(t) = 
1/t. Then 


vt (ê = 6) € AsN(O. Po) 
= (11.51) 
Po = do [ EY tt. 00W tE. 0o] 


See Appendix 11A for techniques of proof and more insights and results. 


General Norms, Multivariable Case (*) 
Starting with a general criterion 


t 
V0. Z) = y(t) È BU, DE (elk, 8), k) 
k=l 


where dim £ = dim y = p, leads to a Gauss-Newton RPEM 
Ô) = Ôl — 1) + YOR YDELE). 1) 
RE) = R(t — 1) + v@) [VME Eey — Re — 1)] 


Here £ĉ isa p x 1 column vector, y(t) isa d x p matrix, and £7, isa p x p matrix. 
The case with explicit ĝ-dependence in £ is analogous. / 


(11.52) 


11.5 RECURSIVE PSEUDOLINEAR REGRESSIONS 
Consider the pseudolinear representation of the prediction (7.112): 


FIO) = p (t, 00 (11.53) 


and recall that this mode! structure contains, among other models, the general lincar 
SISO model (4.33). A bootstrap method for estimating 0 in (11.53) was given by 
(10.64): 


m eiia E E E 
B = AED A [RE] AÂ. z (11.54a) 


t 
RE? = y $ Blt K(k. Â T oT k AET O (11.54 
k=1 


fi. Z') = vit) X BU. kolk, O)e(k. 8) (11.55) 


k=1 
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Here we replaced the equally weighted sums in (10.64) by generally weighted ones. 
analogous to (11.33). 


With the same approach as for the recursive prediction-error method (making 
one new iteration at the same time a new measurement is brought in, and assuming 


the previous estimate was a solution [ frst = 174) = 0]). we obtain from 
(11.54) 


6(t) = 6 — 1) + y (ÐR Dyl, A(t — We(t. 6(t —1)) (11.56a) 
R(t) = R — 1) 


fe vit) [pe ôt ~ 1)97t, 6 — 1)) — RU 1)] (11.56b) 


This algorithm suffers from the same problem as (11.40): The computations of 
y(t, ĝt —1)) and e(t. A(t —1)) cannot usually be performed recursively. This prob- 
lem can, however, be solved in the same way as for RPEMs, see (11.41). We thus 
form as an approximation of ¢(f. êt — 1)) a vector y(t) in which all 6-dependent 
entries are replaced by recursively computed quantities, analogous to (11.47). We 
then have the recursive pseudolinear regression (RPLR): 


5) = g(t — 1) 


elt) = y(t) — F(t) 
(11.57) 


A(t) = Ô(t — 1) + YOR Mee) 
Rt) = RE - 1) + yO [ep t — RE - DI 


This algorithm looks exactly like the RLS algorithm (11.16). The same software can 
thus be used for RPLR as for RLS. The operational difference lies in the fact that 
g(t) in (11.57) contains entries that are constructed from data using past models. 
This also affects the convergence properties of the scheme (see the following). 
Notice that the only difference between RPLR compared to a RPEM for the 
model structure (11.53) is that y in (11.44b and c) has been replaced by gy. For the 
general SISO structure (4.33), the relationship between y and ¢ is given by (10.54). 


Family of RPLRs 


The RPLR scheme (11.57) represents a family of well-known algorithms when ap- 
plied to different special cases of (11.53). The ARMAX case is perhaps the best 
known of these. With g(t) defined by (11.47). the algorithm (11.57) constitutes a 
scheme for estimating the parameters of an ARMAX model. This scheme is known 
as extended least squares (ELS). Other special cases are displayed in Table 11.1. 
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TABLE 11.1 Classification of Some Recursive Identification Schemes * 


Model Structure RPEM RPLR 
ARX RLS RLS i 
ARMAX RML (Söderström. 1973) ELS (Young, 1968: Panuska. 194s; 
ARARX RGLS (Hasting-James and Bethoux. 1976 


Sage, 1969): Gertler and 
Banyasz. 1974) 


ARARMAX — EMM (Talmon and 
van den Boom, 1973) 

OE White. 1975 Landau, 1976 

BJ Young and Jakeman. 1979 — 


* Compare with Table 4.1. 


Asymptotic Properties 


Convergence results for the RPLR scheme (11.57) have been given only for some of 
the special cases in Table 11.1. Since it differs from RPEM in that ẹ is replaced by 
gy, one might guess that the convergence properties will depend on the relationship 
between these two vectors. 

For ARMAX structures, we have (11.48) 


1 
(t) = ~—¢(t) 
á Cao 

In fact. it turns out that a sufficient condition for the ELS estimate to converge to 

the true parameter values (S € M) is that 
1 1 f 
Re——— > -, Vw 11.58 
Cole!) T ( ) 


where Colą) is the C-polynomial of the true system description. The condition 
(11.58) is often expressed as “the filter (1/Co(q)) — i is positive real” and can be 
seen as a condition that Co(q) is close to unity (see Problem 11E.4). When RPLR is 
applied to the OE structure (4.25) (Landau’s scheme), the corresponding condition 
for convergence is 


1 1 
e——— - - > 
Fo(e’”) 2° 


See Appendix 11A for references and further insights and results. 


11.6 THE CHOICE OF UPDATING STEP 


The recursive identification algorithms (11.44) and (11.57) are largely given by their 
off-line counterparts. The calculation of the prediction is derived from the corre- 
sponding model structure and the selection of y(t) or y(t) has its roots in the choice 


Sec. 11.6 The Choice of Updating Step 377 


between the prediction-error or correlation approaches. What remains is the quan- 
tity y(t)R7'(t) that modifies the update direction and determines the length of 
the update step. In this section we shall discuss some aspects of how to determine 
y(t) and R(t). [For notational convenience, we give the expressions for the RPEM 
(11.44b). RPLR is analogous with g(¢) replacing W (t).] 


Update Direction 


There are two basic choices of update directions: 


1. The “Gauss-Newton™ direction, corresponding to R(t) being an approximation 
of the Hessian of the underlying identification criterion: 


Rt) = RE — 1) + ye) [WOW - RE - dD] (11.59) 


2. The “gradient” direction, corresponding to R(t) being a scaled version of the 
identity matrix: 


RW) = WOOP -I (11.60) 


or 


R(t) = R0 — 1) + yO [I -I — Re - 1)] (11.61) 


The choice between the two directions can be characterized as a trade-off between 
convergence rate and algorithm complexity. Clearly, the Gauss-Newton direction 
requires more computations. Updating y(t)R~!(r) as in (11.12c) (see also Section 
11.7) requires proportional to d? operations. which will typically constitute the dom- 
inating part of the computational burden in (11.44). The gradient direction can be 
implemented with proportional to d operations per update. 


On the other hand, the convergence rate can often be drastically faster with the 
Gauss-Newton direction. For the constant-parameter case. analysis shows that this 
update direction will yield estimates whose asymptotic distribution has a variance 
equal to the Cramér-Rao lower bound {see (11.51)]. This is not true for other update 
directions. Notice. though, that this theoretical result holds for the time-invariant- 
system case only. When the true system parameters are drifting, it will typically be 
better to use another update direction adapted to the parameter drift as in (11.67) 
(see Benveniste and Ruget, 1982). An interesting possibility to speed up convergence 
by averaging is described in Polyak and Juditsky (1992). 


Update Step: Adaptation Gain 


An important aspect of recursive algorithms is, as we noted in Section 11.2, their 
ability to cope with time-varying systems. There are two different ways of achieving 
this: 
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1. Selecting an appropriate forgetting profile f(z. k} in the criterion (11.33) or 
selecting a suitable gain y(t) in (11.44) or (11.57). These two approaches are 
equivalent in view of the relationships (11.7). (11.13). or (11.14). which may be 
summarized as 


t f 
l k) , 
Bik) = [J a= aa [] a@-yvun (11.62a) 
j=k+1 Y j=k+ł 
=l 
— 1 1 a 
y(t) = L) (11.62¢) 
yit — 1) 


2. Introducing an assumed covariance matrix R,(t) for the parameter changes 
per sample as in (11.29c). This will increase the matrix P(t) and hence the gain 
vector L(t). 


In either case, the choice of update step or “gain” in the algorithm is a trade-off 
between tracking ability and noise sensitivity. A high gain means that the algorithm 
is alert in tracking parameter changes but at the same time sensitive to disturbances 
in the data, since these are erroneously interpreted as signs of parameter changes. 
This trade-off can be discussed more precisely in terms of the quantities A(t). yiri, 
and R,(f). 


Choice of Forgetting Factors 4(1) 


The choice of forgetting profile (t. k) is conceptually sifiple: Select it so that the 
criterion essentially contains those measurements that are relevant for the current 
properties of the system. For a system that changes gradually and in a “stationary 
manner,” the most common choice is to take a constant forgetting factor: 


Bit, k) = TOF: ie A0) BA (11.63) 

The constant A is always chosen slightly less than 1 so that 
BUI k) = e75) logs. x etki A) (11.64) 
This means that measurements that are older than Z) = 1/(1 — A) samples are 


included in the criterion with a weight that is e7! 36% of that of the most recent 
measurement. We could call 


raat (11.65a) 


the memory time constant of the criterion. If the system remains approximately 
constant over To samples. a suitable choice of 4 can then be made from (11.65a). 
Since the sampling interval typically reflects the natural time constants of the system 
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dynamics, we could thus select 4 so that 1/(1 — A) reflects the ratio between the 
time constants of variations in the dynamics and those of the dynamics itself. Typical 
choices of 4 are in the range between 0.98 and 0.995, 

We can also consider the response to a sudden change in the true system. If 
the change occurred k samples ago, the ratio of relevant-to-obsolete entries in the 
criterion is 1 — 2*. The response to a step change in the system is thus like that of 
a first-order system with time constant (11.65a). For a constant system belonging to 
the model set. it follows from Problem 11A.6 that the deviation of the estimate from 
the true value behaves like 


1-—A 


E(B(t) — oÊ) — o)" ~ ào [EY (t, 00) ¢. 0o] (11.650) 
Here Ay is the true innovations variance. The two expressions (11.65a and b) describe 
in formal terms the trade-off in à between tracking alertness and noise sensitivity. 
For a system that undergoes abrupt and sudden changes, rather than steady 
and slow ones, an adaptive choice of A could be conceived. When an abrupt system 
change has been detected. it is suitable to decrease A(t) to a small value for one 
sample. thereby “cutting off’ past measurements from the criterion. and then to 
increase it to a value close to 1 again. Such adaptive choices of A are discussed, for 
example. in Fortesque. Kershenbaum, and Ydstie (1981)and Hägglund (1984). 


Choice of Gain y (t) 


The choice of gain can be translated from the corresponding choice of forgetting 
factor using (11.62). A constant forgetting factor A gives, after a transient, a constant 
gain 

y=1-A 


Similarly, a sudden decrease at time fọ in A(t) to a small value and then back to 1 
corresponds to a sudden increase in y(t) to a value close to 1 [see (11.62)], and then 
y(t) © 1/(t — to). 

It is. however, also instructive to discuss the choice of gain in direct terms. 
Intuitively. the gain should reflect the relative information contents in the current 
observations. An observation with important information (compared to what is 
already known) deserves a high gain, and vice versa. This is a useful principle that can 
be applied to a variety of situations: For a constant system the relative importance of 
a single observation decays like 1/t. After a substantial change in system dynamics, 
the relative information in the observation increases. A measurement with a large 
noise component has low information contents, and so on. See also Problem 11E.3. 


Including a Model of Parameter Changes 


Analogously to the Kalman filter version (11.29), we could introduce an assumption 
that the true parameters vary according to 


Ot) = O(r — 1) + w(t) (11.66a) 
Ew(t)w'(t) = R(t) (11.66b) 
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If we assume that the innovations variance is R(t) we get the following version of 
the general algorithm (11.44): 


6(t) = O(t — 1) + Line(t) 
elt) = y(t) — F(t) 


P(t — I)yt) 


Lt) = —— 
Raith + WOOP — Dwr) 


Pit — 1)wit)w(t) P(r — 1) 
Rot) + We) P(t — Dwr) 


P(t) = Pt—-1- + R(t) 


In the case of a linear regression model. this algorithm does give the oprimal 
trade-off between tracking ability and noise sensitivity, in terms of a minimal a pos- 
teriori parameter error covariance matrix. (This follows from the original derivation 
of the Kalman filter, Kalman and Bucy, 1961. as pointed out in Bohlin, 1970. and 
Åström and Wittenmark, 1971). However, for other models the algorithm (11.67) is 
somewhat ad hoc. See Problem 11T.2 for a heuristic derivation of it. Nevertheless, 
it is a very useful alternative, in particular if we have some insight into how the pa- 
rameters might vary (e.g.. if certain parameters vary more rapidly than others). A 
fringe benefit of the algorithm is that P(t) is an estimate of the variance of the pa- 
rameter error, also taking into account the variation of the true system. For a linear 
regression with normal disturbances and normal drift, P(t) is exactly the covariance 
matrix of the posterior distribution of 0 (t). the mean being 0(1). See (11.29). 

The case where the parameters are subject to variations that themselves are of 
a nonstationary nature [i.e., R;(¢) in (11.66) varies substantially with ¢] can be dealt 
with in a parallel algorithm structure, as described in Andersson (1985). 


Constant Systems 


For a time-invariant system, the natural forgetting profile is A(t) = lor y(t) = 1/1. 
However, it turns out that for many recursive algorithms (not including RLS) the 
transient convergence rate is significantly improved with a forgetting factor that 
increases from. say. 0.95 to 1 over the first 500 data or so: 


A(t) = 1 — (0.05) - (0.98)! (11.68) 


The reason apparently is that early information is somewhat misused and should 
therefore carry a lower weight in the criterion compared to later measurements. 
whose information contents are processed in a better way (the filters corresponding 
to (11.42) are then more accurate). 
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Asymptotic Behavior in the Time-varying Case 


A heuristic analysis of the asymptotic behavior of (11.67) can be carried out as fol- 
lows. 


Suppose that the true parameters vary according to (11.66). and that we con- 
sider just the linear regression case: 
x(t) = w(t — 1) + elt) Eet) = do 
Let us also study a simplified algorithm 
b(t) = O(t — 1) + Pwitye(t) 


= Or — 1) + PHD (xO — WHE - D) (11.69) 


for some constant “small” matrix P. For the LMS-case (11.45) with constant gain, 
this corresponds to P = y. while the forgetting factor case. (11.19) or (11.59), with 
a constant forgetting factor A = 1 — y corresponds to 


1 -1 
P(t) = (Zavora) x (1—aA)S7! = P 
k=1 
S = Ey (Dyt) 


where the approximation holds for A close to 1 and large t, so that we can average 
over the many non-vanishing terms. Equation (11.69) gives 


A(t) = A(t) — 6(t) 
A(t) = (1 — Pye D) EE — 1) — Pyet) + wit) 


If we square both sides and take expectation. disregarding any correlation between 
0 and w and assuming stationarity. we obtain 


M 


E6(t)07(1) 
rn 


Il 


O — PST] — NSP + PSIISP + PSPdAg + Ri. or 


PSO + ISP = PSPào + R, 
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where in the last step we ignored the term PSTISP (since P and IT are “small” ). If 
we insert the RLS-choice P = (1 — A)S~! we obtain from this the parameter error 
variance 


n=? (1 — à)àoST! + : R se wS + ER ll 
= — ———— = — A — 7 
2 i TF aa y aE) 


which is a generalization of {11.65b) to the time-varying case. Similarly. the LMS- 
choice gives that the parameter error is the solution TI to 


1 
SA + NS = yaAgS + oo (11.71) 


These expressions clearly show how the choice of gain y = 1 — A is a trade-off 
between the tracking ability R,/y. which favors large y. and the noise sensitivity 
y Ay. which favors small y. 

In Guo and Ljung (1995)it is formally verified that TI indeed is the actual 
covariance matrix of the parameter errors, up to terms that tend to zero in a well- 
defined way as y tends to zero. In that paper the general algorithm (11.67) is treated 
in the linear regression case. See also Ljung and Gunnarsson (1990). which describes 
the extension to non-linear parameterizations. 


11.7 IMPLEMENTATION 


The basic, general Gauss-Newton algorithm was given in the form (11.44) or (11.57). 
It is clearly not suited for direct implementation as it stands, since a d x d matrix Riz) 
would have to be inverted at each time step. In this sectipn we shall discuss some 
aspects on how to best implement recursive algorithms. A more thorough discussion 
is given in Chapter 6 of Ljung and Söderström (1983). 


Using the Matrix Inversion Lemma 


By applying the matrix inversion lemma (11.10) to (11.44). we obtain, analogously 
to (11.11), the algorithm (for the case of vector outputs) 


6(t) = 6(t — 1) + Litje(t) (11.72a) 
elt) = y(t) — V(r) (11.72b) 
Lit) = PE — in) [AOA + OPE- Dn] (11.72c) 


P(t —1)— PE — Unit) AODA +n) PO Un] ATO) P G1 


es KO) 


(11.72d) 
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Here y(t) (ad x p matrix) represents either g or y depending on the approach. 
In this form the dimension of the matrix to be inverted is only p x p. so the saving 
in computations compared to (11.44) is substantial. Unfortunately. the P-recursion 
(11.72d) (which in fact is a Riccati equation) is not numerically sound: the equation 
is sensitive to round-off errors that can accumulate and make P(r) indefinite. since 
P(t) is essentially computed by successive subtractions. 


Using Factorization 


As we discussed in Section 10.1, it is useful to represent the data matrices in factorized 
form [see (10.11)] so as to work with better-conditioned matrices. For recursive iden- 
tification, this means that we represent P(t) as a product of matrices and rearrange 
(11.72d) so as to update these matrices instead of P itself. Useful representations 
are 


P(t) = ONH) (11.73) 
which, for triangular Q, is the Cholesky decomposition and the UD-factorization: 
P(t) = U(@)D(HU(t) (11.74) 


with U(r) as an upper triangular matrix with all diagonal elements equal to 1 and 
D(t) as a diagonal matrix. Potter (1963)has given an algorithm for updating Q(t) in 
(11.73) and Bierman (1977)has developed numerically sound algorithms for U and 
D in (11.74). Here we shall give some details of a related algorithm, which is directly 
based on Householder transformation (see Problem 10T.1). It was given by Morf 
and Kailath (1975). 


Stepl. Attimet—1.let O(f—1) be the lower triangular square root of P(t—1) 
as in (11.73). Let u(t) be a square root of A(t)A,. Form the (p + d) x (p +d) 
matrix 


(11.75) 


Li —1) = | H(t) 9 | 


Qt- Dn Or -D 
Step 2. Apply an orthogonal (p + d) x (p + d) transformation 
T T=) 


to L(t —1) so that T£(t — 1) becomes an upper triangular matrix. T can, for example. 
be found by QR-factorization. Let I(t), L(t). and Q(t) be the p x p.d x p.and 
d x d matrices defined by 


T 7 7. 
a aes | (11.76) 


ene mae =| 0 Or) 


(Clearly, IT and O are lower triangular.) 
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Step 3. Now with L(t) and P(t) as in (11.72c and d). we have 


La) = Lyn) 
— A 
Pu) = Qe (r) 
A(t) 
MONTO = ADA, + 0P — 1n) (11.77) 
Hence 
Qir) 
(t) = (11.78 
A 


Verification. Multiplying (11.76) by its transpose gives 


Ma) 0 mt) Lr) 
re aa 0 OW) 
ONKO, METE) 
Leone) JwT aO + LWL) 
= L" — TTL — 1) = u — DLE - 1) 


hae + nave = OTe — ney au — NOTE ~ E 
Or — DQ E — Dnie) Ql — DE- 1) 


Using the facts that Q(t — 1)QTa — 1) = P(t — 1) and u'u) = AA, it 
is now immediate to verify the equalities in (11.77) by a comparison with (11.72¢ 
and d). 

There are several advantages with this particular way of performing (11.72c and 
d). First, the only essential computation to perform is the triangularization step (or 
Q R-factorization) {11.76). for which several good numerical procedures exist. This 
step both gives the new Q and the gain L after simple additional calculations. Note 
that FI (r) is a triangular p x p matrix, so it is a simple matter to invert it. Second. 
in the update (11.76) we only deal with square roots of P. Hence the conditioning 
number of the matrix £(t—1) is much better than that of P. Third, with the triangular 
square root Q(t) it is easy to introduce regularization, that is, measures to ensure 
that the eigenvalues of P stay bounded, at the same time as P remains positive 
definite. 
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Lattice Algorithms (*) 


In (10.30) and (10.32). we gave a lattice filter scheme for computing predictions that 
can be applied to certain model structures (see also Problems 10E.8, 10D.2). This 
scheme is not recursive, since the variables n. ên(t), and Pp (t) rightly should carry 
also an index N, reflecting their dependence on the whole data record via (10.32). A 
tempting approach would be to approximate eN (t) and PNC — 1) in (10.32) by the 
past, already computed residuals é! (t) and r!~'(t — 1). Then , could be recursively 
computed, and we obtain a scheme 


ên- (t) = én (t) + Pni(t al 1}, (t = 1) 


Fni(t) = Fa(t nY 1) + Putt ae 1)e,(t) 
Ry (t) = R(t — 1) + en(Fn(t — 1) 


RE) = RE(t — 1) + & (2) (11.79) 
x Riv (0) 

t) = -—— 
Palit) RW) 


éo(t) = Folt) = y(t) 


This algorithm was developed by Griffiths (1977)and Makhoul (1977) and has 
been called a gradient lattice algorithm. Itinvolves approximation. as we pointed out, 
and consequently does not exactly implement the off-line estimate. It is interesting 
to note, though, that by slightly modifying the R” and R° updates an exact version 
is obtained. This was proved by Lee. Morf. and Friedlander (1981). We here give 
the resulting algorithm for the case of a multivariate signal z(t) in (10.12) (which 
includes applications to dynamic systems; see Problem 10E.7). We also include an 
arbitrary constant forgetting factor A. 


1. Initialize at £ = 0: Let 
Rz (0) := ôl. Ri(-1) = ôl, R° (0) = 0, r, (0) := 0. n=Q....,.M-1 
2. At time t — 1. store 


Ri(t — 1). R(t — 1), Rit — 2). rat — 1). n=0,....M-1 


3. At time £ — 1, compute for n =0,.... M -1: 
pea — 1) = -RYE — 17 [RE — 2] 


(11.80) 


pba — D = -REE — [Re — DY 
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4. Forn =0,..., M — 1, update 


€o(t) = Folt) = z(t) 
nyit) = ên(t) + Ôn (t = rnt = 1) 
Prai(t) = Palt — 1) + ARG — 1)ên Ct) 
5. Update for n =0..... M — 1: 


Rẹ — 1) = AR; — 2) + [1 — Balt a(t — Dri — 1) 
RE) = ARS(t — 1) + [L — Baltlen(tye, (t) 
RED = ARZ — 1) + [1 — Blt) rat — Der) 
Breit) = Bilt) + BDP re — DERE - DP mG — D 
Pot) = 0 
6. Go to step 2. 
The prediction of z(t) based on c(¢ — 1). ....2(¢ — n) is thus given by 
En(t) = C) — én(t) 
It can be computed before z(t) is received by rearranging (11.80b) as 


Èn (t) = En-1(t) — OF Dnt — 1). Z(t) = 0 


11.8 SUMMARY Í 


Recursive identification algorithms are instrumental for most adaptation schemes. 
A recursive algorithm can be derived from an off-line counterpart using the philos- 
ophy of performing one iteration in the numerical search at the same time as a new 
observation is included in the criterion. In some special cases (e.g.. the recursive least 
squares and the recursive instrumental variable cases) this leads to algorithms that 
exactly calculate the off-line estimates in a recursive fashion. In general. though. the 
recursive constraint means that the data record is not maximally utilized. 

With the described philosophy, three basic and common classes of recursive 
methods can be distinguished: 


1. Recursive prediction-error methods (RPEM), (11.44) 
2. Recursive pseudolinear regressions (RPLR). (11.57) 
3. Recursive instrumental-variable methods (RIV). (11.32) 


The asymptotic properties obtained by a RPEM for a constant system. using 
a Gauss-Newton update direction. coincide with those of the corresponding off- 
line prediction-error method. This is a very important result, which means that the 
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discussion and results of Chapters 8 and 9, as well as their consequences for user’s 
choices in Part III. apply equally well to RPEMs. Convergence of PLRs is tied to 
positive realness of certain transfer functions associated with the true system. 

The identification problem includes a variety of design variables to be chosen 
by the user. Many of these are common to both off-line and recursive methods. 
Recursive algorithms include, in addition. two important quantities that may have a 
considerable effect on the quality of the estimates: the update direction and the gain. 
Apparently. the Gauss-Newton direction is a very good choice of update direction for 
constant systems, even though it may include more calculations than other choices. 
As a guiding principle for the choice of gain. it can be said that it should reflect the 
relative information contents in the current measurement. 

The principles of recursive identification as outlined in this chapter can be ap- 
plied to “stochastic” and “deterministic” systems equally well. The distinction. which 
is partly semantic. shows up on two occasions: (1) when forming the prediction model 
one can apply some probabilistic machinery or simply “guess, and (2) when selecting 
the adaptation gain, the “relative information contents in the current measurement” 
can be interpreted as a formal signal-to-noise ratio or in a more intuitive manner. 


11.9 BIBLIOGRAPHY 


A thorough treatment of recursive identification following the framework of the 
present chapter is given in Ljung and Söderström (1983). The monograph by Young 
(1984)gives a comprehensive study of recursive estimation techniques with a focus on 
instrumental variable methods, and Solo and Kong (1995)}contains a comprehensive 
study of the asymptotic behavior. A survey of algorithms for adaptation and tracking 
is given in Ljung and Gunnarsson (1990). Widrow and Stearns (1985)focuses on 
adaptive signal-processing applications. Adaptive control is treated, for example, in 
Åström and Wittenmark (1989), Landau (1979), and Goodwin and Sin (1984). 


Section 11.2: The derivation of the RLS algorithm goes back to Gauss (1809), as 
discussed in Appendix 2 of Young (1984). A more recent “early” reference is Plackett 
(1950). The relationship between RLS and the Kalman filter has been discussed by 
Ho (1963), and Bohlin (1970). 


Section 11.3: RIV has been extensively discussed in Young (1968)and Young (1984). 


Section 11.4: The general RPEM as presented here was derived in Ljung (1981). 
The RML (11.49) was derived by Söderström (1973)and Astrém (1972). A similar 
algorithm was independently suggested by Fuhrt (1973). Moore and Weiss (1979)sug- 
gested a general RPEM. based on a slightly different framework. A comprehensive 
bibliography of the ramifications of RPEM is given in Ljung and Söderström (1983). 
The asymptotic distribution of the RML algorithm has been discussed also by Han- 
nan (1980a)and Solo (1981). 

Section 11.5: Apparently, the first RPLR algorithm was the ELS scheme developed 
by Panuska (1968)and Young (1968). The term PLR is adopted from Solo (1978). 
Variants of RPLR have been referred to in the text. The convergence analysis of 
such schemes has been developed by Ljung (1977a), Moore and Ledwich (1980). and 
Solo (1979). The classification of Table 11.1 goes back to Ljung (1979b). 
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Section 11.6: This section is based on Chapter 5 in Ljung and Söderström ( 1983), 
Gain adaptation and step-size selection have also been discussed in Bohlin (1976), 
Benveniste and Ruget (1982). and Kulhavy and Kraus (1996). Relations to detection 
are treated in. e.g., Basseville and Nikiforov (1993)and Gustafsson (1996). 


Section 11.7: Chapter 6 of Ljung and Söderström (1983)gives more details of the 
implementation. The numerical properties of the schemes have been studied in L jung 
and Ljung (1985). Graupe., Jain. and Salahi (1980). Mueller (1981). and Samson and 
Reddy (1983). Algorithms for fast calculation of the vector L(t) in (11.72) have been 
derived in Ljung, Morf. and Falconer (1978)and discussed in Carayannis, Manolikis, 
and Kalouptsidis (1983). Cioffi and Kailath (1984). and Lin (1984). The literature on 
lattice algorithms is extensive. See, for example. Lee, Morf. and Friedlander (1981), 
Friedlander (1982), Samson (1982), and the monograph by Honig and Messerschmitt 
(1984). 


Appendix 11A: In addition to the references in the text, the ODE approach to 
the convergence analysis of recursive algorithms has also been discussed in Liung 
(1978b), Ljung (1984). Ljung. Pflug. and Walk (1992), Kushner and Clark (1978), 
and Benveniste, Métivier. and Priouret (1990). The link between simple recursive 
algorithms and an associated ODE was established by Khasminski (1966)for non- 
decreasing gain algorithms. Related techniques to study the asymptotic distribution 
are described in Kushner and Huang (1979)and Benveniste and Ruget (1982). 

For RPLRs. a martingale-based convergence technique, Moore and Ledwich 
(1980). Solo (1979), has been quite successful. See Goodwin and Sin (1984) for 
numerous applications of this technique to adaptive control algorithms. 


11.10 PROBLEMS 


11E.1 Apply RPEM to a first-order ARMA model 4 
y(t) + ay(t — 1) = e(t) + cet — 1) 
Derive an explicit expression for the difference 
$) — $016@ — D) 


Discuss when this difference will be small. 


11E.2 Suppose that the algorithm (11.12) is used with A(t) = 1. Suppose also that $ € M. 
Show that the variance of 6(r) (neglecting initial-value effects) is then given by Ag P(r). 
where Ag is the innovations variance. 

11E.3 Formalize the notion that the gain y(t) should “reflect the relative information con- 
tents in the current observation.” as follows: Decompose. for the RLS method. the 
prediction error e(t) = y(t) — 6% (t — 1)p(t) = ealt) + T(t — 1)g(t), where G(r) is 
the parameter error & — 8 (t). Interpret the scalar g(t)L(t) in the algorithm (11.29) 
as the signal to signal + noise ratio for the “measured” quantity e(t). 


11E.4 Prove that 


ea SS |i — Co(e)| < 1 
Colet”) 2 
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11T.1 Convert the Newton-Raphson algorithm (10.49) for the correlation approach into a 
recursive algorithm. and compare with the RPLR algorithm (11.57). 


117.2) Consider a general model structure and assume thal the true parameter vector is 
varying according to (11.66). Suppose that it is known that 6(f) is close to a known 
value 4, (1) (like. e.g.. the previous estimate). With the true innovation e(t}, we thus 
have 


yt) = FAA + ealt) 
Now approximate 
HOGD S FY) + WO) (OU) — Hd) 
Introduce the variable 
yelt) = vt) — Fr) | YTE. EGU) 

whose value is known at time ¢, and write the model as 

A(t) = Ot — 1) ~ u(t) 

val?) = OWT. Helt) + elt) 


With y (r.@.(7}) = wit) a known vector. this is now of the linear regression type 
(11.25) and (11.28). Apply the optimal algorithm (11.29) to this approximate descrip- 
tion and show that (11.67) is obtained. 


11D.1 Let B(t. à) be defined by (11.6) and let y (7) be defined by (11.13). Show that 


yk) + 
gek = — [[n- ro] 
y(n joh+t 


1tD.2 Verify the expression (11.19). 


APPENDIX 11A: TECHNIQUES FOR ASYMPTOTIC ANALYSIS OF 
RECURSIVE ALGORITHMS 


Methods to prove convergence and analytically assess the quality of recursively com- 
puted estimates tend to be technical. A comprehensive treatment is given in Chapter 
4 of Ljung and Söderström (1983), and we provide in this appendix some insights 
and guidelines. 


Most of the existing results deal with the convergence properties of 6(t) as f 
tends to infinity and the gain y(z) tends to zero. There are also some results on the 
asymptotic distribution of 6(t) under the same assumption. Results for the tracking 
case. when the true parameter changes and the gain y (7) does not tend to zero. have 
been studied by Benveniste and Ruget (1982). Kushner and Huang (1981). Guo and 
Ljung (1995), Solo and Kong (1995). and others. 
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An Algorithm Format 


The recursive algorithms can all conceptually be given in the form 


A(t) = Ôl — 1) + y (ÐR! neel) (11A.la) 
elt) = y(t) — Hr) (11A.1b) 
a a y(t) 
Ei +1) = ACUDE) + BOW) | H (1LA.1e) 
ul 
V(t + 1 R 
IERIE _ cOange + (11 A.1d) 
nit + 1) 


The choice of R could be any positive definite matrix. Most common is, however, 
the Gauss-Newton choice 


RQ) = RE - 1) + yO [nn — Ra - dD] (11A.2) 


Here (tr) corresponds to y(t), y(t), or ¢ (t), depending on the particular algorithm 
used. 

In some special cases the recursive algorithm (11A.1) can be solved to vield 
an explicit expression for 6(t). This happens for the RLS algorithm (11.16) [sve 
(11.19)] and for the RIV algorithm (11.32) with given instruments [see (11.30)]. In 
these cases an analysis can be carried out based on the explicit expressions. This 
analysis will then coincide with the off-line analysis of Chapters 8 and 9. 

In most cases. however, no explicit expression for ĝt ycan be obtained. Indeed, 
A(t) will be a fairly complicated function of the data set Z’, partly as a result of the 
time-varying, estimate-dependent filtering in (11A.1cd). See (11.41). This means 
that it is a difficult problem to determine the asymptotic properties of the estimate 
A(t). In this appendix we shall give some insights into the behavior of (11A.1). and 
we shall also state some basic results on convergence and the asymptotic distribution 
of (1). For a formal analysis, we refer to Ljung and Söderström (1983). 


Error Models 


Regarding y (t) and R—'(t) as modifications of the basic update information n(t)e (7). 
it follows that the relationship between the crucial quantities 


6—n-e (11A.3) 
will be a key to the convergence properties of (11A.1). Such a relationship is often 


called an error model (in particular in connection with adaptive control applica- 
tions}. 
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We can write. symbolically. 


AO(t) ~ ni(tye(t) (11A.4) 


for (11A.1a). The right side is a random variable. If y(t) is small so that a noticeable 


change in 6 is the result of many steps like (11A.4). then this change is likely to occur 
in the direction of the expected value: 


Aé ~ Ene (11A.5) 


All asymptotic analysis of (11A.1) is based on the concept (11A.5). but there are 
several ways of formalizing the analysis. Here we shall! bnefly describe an approach 
based on an associated ordinary differential equation (ODE). 


An Associated ODE 
Let ¥(r|6) and y(t, 0) be defined by 


7 
E(t + 1,0) = A(O)E(t. 0) + BO) : 
u(t) 
MUB: |) C(O)E(t. ) (114.6) 
n(t.@) | | 
Define 
f(0) = En(t. @)e(t. 0) (11A.7) 


as the average update direction, associated with the parameter value 9. Then (11A.1) 
is associated with the following ODE: 


d 
dt 
If R(t) is determined as in (11A.2). then define 


p(t) = R! f(0p(t)) (11A.8) 


G0) = En(t.0)n"(1. 0) (11A.9) 


and the associated ODE becomes 


d 
sr Ott) = Rp) p(t) 
(11A.10) 


d 
77 Ree) = G(@p(t)) — Ro(t) 
T 


We shall shortly describe in what sense (11A.10) is “associated” with (11A.1) and 
(11A.2), but first let us discuss heuristically how it is obtained. 
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Suppose that R(r) in (11A.1a) is fixed to be R, that @(f) = 6. and that: > 4, 
is chosen so that 6(t) and 8 (tọ) are close. Then 


I 


t t 
RT! X vdontkyetk) ~ Ro! XO vikin(k. elk. 9) 


kan k=ty 


6(t) = 6(ty) 


R-'En(k. O)e(k.8) Y yik) 


k=! 
f (11A.11) 
+ RYO y(k) [nk Belk. 8) — Enk. Deck. )] 
k=ħ 
x ArtR 'Enik. elk. 0) 
where l 
Ar È Yo vk) (114.12) 


k=l, 


Here the first approximation comes from 6(k) being close to @ for tọ < k < t and 
thus replacing (k) by 8 in (11A.1c d). yielding (11A.6). The second approximation 
amounts to neglecting the second term, being a sum of zero-mean random variables, 
compared to the mean value. This is an application of a law of large numbers. Now. 
changing the time scale from ż to T so that 


t 
OnT) e Ot), u= dv) (114.13) 
k=] 


we can write (11A.11) as 


Op(t%) = Op(t,) + (Te — n) RT! f (Op(t,,)) (11A.14) 


This expression is Euler’s method for solving the ODE (11A.8), and hence the link 
between (11A.1) and (11A.8) is heuristically established. Including (11A.2) leads to 
(11A.10) in a similar way. 

The heuristic discussion suggests that the trajectories of (11A.10) describe the 
behavior of (11A.1) and (11A.2) if y is small enough. With a certain amount of tech- 
nical work. the following links can be formally established (under certain regularity 
conditions): 


e If y(t) > Oast — œ and all trajectories of (11A.10) converge to a set 
De C Dm, then the estimates 6(f) converge to D, with probability last — > 
provided they are constrained to Dy. (11A.15) 

e If y(t) > Oast > oo and O(t) > 6* with positive probability, then @* must 
be a stable stationary point of (11A.10). (11A.16) 
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e lf 8 (to) = @ and @p(r) denotes the solution to (11A.10) with @p(t,,) = ð. 
then 


P| sup 
tast <T 


where 1,. is defined by (11A.13). 


Ôl) — Op(t%) 


z 
> | < K(e.p) >) y?(k) (11A.17) 


k= 


For proofs of these statements, see Ljung (1977b). 


Convergence of RPEM 
For the RPEM family. we have 


n= [v & ~ He) (114.18) 


This defining relationship can indeed be seen as an error model (11A.3) that holds 
regardless of the properties of the data. 
We also have 


f@) = Ewct.0)e(t.0) = -2 Eet.) (11A.19) 

Let 

V0) = Ee" (t.0) (11A.20) 
Then along trajectories of (11A.8) (recall that R > 0) 

LV p(t) = LVO) bo R™ f (@p(t)) 

= — f7 @p(t)) R f p(t) (11A.21) 
which shows that V is decreasing outside the set 

D: = {6|f (6) = 0} (11A.22) 


V(@) is thus a Lyapunov function for (11A.8), showing that its trajectories (among 
those that stay in Dat) will converge to De as t — oc. According to (11A.15), this 
implies that 


ôG) > De.  wp.last > œ (11A.23) 


(or that 6(t) tends to the boundary of Dm, which cannot be excluded unless the par- 
ticular way 6(t) is projected into Dm [see (11.50)] is specified). In view of (11A.16), 
points that are not loca] minima of V(6) can be excluded from D,. We thus find 
that the RPEM estimate will converge w.p. 1 to a local minimum of V (6), which is 
exactly the same result as for the off-line counterpart (10.40). We summarize this as 
follows: 
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Result 17.1. Consider the algorithm (11A.1) with n(f) = Y(t) and with a 
positive definite R(t). Assume that y(t) > Oast — œ and that A(t) is constrained 


to a subset of Da. Then 6(t) converges w.p. | to a local minimum of Vi) (or to 
the boundary of Dm) ast > ov. 


Asymptotic Distribution of RPE Estimates 


Consider the Gauss-Newton RPEM (11.44), and assume that there exists a 4, such 
that €(t. 69) = eg(t) is white noise. Let 


Rit) = Rt ) = Dae ovv TW (11A.24) 
[cf. (11.5b) and (11.43)]. Then, analogous to (11.9b), 
RG) = ADRE — 1) + yy TO (11A.25) 
Introduce (t) = 6(t) — and rewrite (11.44b) as 
RÖ = ROŠ — 1) + yel) 
MORE — YO — 1) + WOW (NO -— 1) + yet) 


This expression can be summed from £t = 0 to t. giving 


RNU) = Blt, 0)R(0)O(0) 


+ Pee oyw [yok - 1 +e] araz 
k=1 f 
Consider the sum 


S, = JL pi. DYK [YE — 1) + etk) — elk. 6)| 


k=l 


By definition, 
e(k) — elk, 0o) ~= elk. A(k)) — €(k. 0) =~ -y Tk) Ko = 00] (1.27) 


where the first approximation follows since e(k) is filtered using estimates approx- 


imately equal to 6(k) [see (11A.1c¢ d)] and the second approximation is the mean- 
value theorem. This means that the sum $, is likely to be negligible compared to 


other terms in (11A.26). Also, B(t, 0)R(0)6(0) should be negligible. Hence 


61) xR nS Bt, kv (k)eo(k) (11A.28) 


k=1 
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Assuming that 6(k) is close to 6a “asymptotically most of the time.” we can replace 
W{k) by (k. 9) in (11A.24) and (11A.28) without too much error. This gives 


t 
(ô) — o0) ~= Ro (0) Y BU. OY k. Goeotk) 


k=l 
= -[V,"(@.Z")] Ve. Z’) (11A.29) 
where 
a : 
.Z') =- kye“ (k. 3 
V,(9. Z’) 5 PU kye7(k. 8) (11A.30) 


From Section 9.2 we know that (11A.29) is exactly the same asymptotic expression 
as we would get for the off-line estimate 


a 


O = arg min V,(0. Z') (11A.31) 
6 


This discussion has been heuristic, neglecting terms without formal justification. 
However. with some technical labor. the approximations can be verified formally. 
This is done in Theorem 4.5 in Ljung and Söderström (1983). The result thus is that 


The asymptotic distribution of estimates obtained by a Gauss-Newton 
RPEM is the same as for the corresponding off-line estimate. (11A,32j 


In particular, for y(t) = 1/rt, which implies that B(t,k) = 1/2 Yk. and we obtain 
from Theorem 9.1 and (9.17) the following result: 


Result 11.2. Consider the Gauss-Newton RPEM (11.44) with y(t) = 1/f. 
Assume that there exists a Gp such that €(f. ) = ex(t) is white noise with variance 
Ay. Then. if 9(t) > 6o. 


VI (Ô) — Bo) € ASN (0. ào [Eyt boye. 6)] ”) (114.33) 


Convergence of RPLRs 


First we postulate a certain error model (11A.3): 


£(t.6) = Blt. 0)(@ — 0) + ealt) 


pt.0) = Ho(g)g(t.@) (11A.34) 
€o(t) independent of o(s, @) 
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Let us investigate what the convergence properties of (11A.1) and (11A.2). with 
n(t) = g(t). might be under the assumption (11A.34). The right side of the associ. 
ated ODE (11A.10) becomes 


£0) = Eyit. Oo (t. 0% — 0) + Egli. @)eo(t) 
= G(0)(% — 8) (114.35) 
G0) = Eg(t.0)p"(t. 0) (114.36) 


since g(t. @) and eg(f) are independent. With 


G6) = Epit. 0p t.0) (11A.37) 
the ODE thus is 


d ~ 
z200) = Rp (t)G(@p(t)) [> — 8p(T)] 


d 
<Rplt) = G @p(t)) — Rol) (11A.38) 
Trying the Lyapunov function 
V(8. R) = (0 — o) RO — 6) (114.39) 
for (11A.38) gives 
d 
aa (@p(t). Ro(t)) 
T 


= -(6 —&)' [eo + G6) — G(0) +R (6 — @) (11A.40) 
Suppose now that the matrix 
G(0) = G6) + G78) — Gie) (11A.41) 


is positive semidefinite for all 0. Then (11A.40) shows that all trajectories of (11A.38) 
end up in 


D. = {9|G(0)(@ — 4) = 0} (12A42) 


G(0) being positive semidefinite is thus a sufficient condition for convergence of 
@(t) into D-. In Problem 11A.5. the reader is asked to prove that 


CNT. G0) > 0 a 
-V Sheo AB 
ReHo(e *) > 5 Vo => {293 eee ee (11A.43) 


This condition as Hy is usually expressed as “ Hy(q) — i is Strictly positive real.” We 
can summarize the result as follows: 
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Result 11.3. Consider the RPLR (11A.1) and (11A.2), with n(t) = g(t) and 
y(t) — Oast — xx. Assume that the relationship (11A.34) holds between e(r, 0). 
6, and y(t. @) and that the transfer function Ho(q) — ; is strictly positive real. Then 


ôM) > D, = {Ole(t,6) = ep(t)} 


with probability 1 as t > œ. 

Notice that we had to assume that the true system belongs to the model set [6 
appears in (11A.34)]. This was not the case for Result 11.1. 

For the ARMAX model, it can readily be shown that (11A.34) holds with 


1 
Ho(q) = ——~ 
a Cog) 
where Co(q) is the polynomial associated with the noise Co(q)eo(t) for the true 
system. See Ljung (1977a). The condition on positive realness will thus be 


1 1 
°C (gia 5 > 0Va (11A.44) 
Since the condition (11A.44) relates to the true system. it cannot be guaranteed a 
priori. With some prior knowledge about the properties of the noise. (11A.44) can, 
however, be somewhat relaxed (see Problem 11A.1). 

When the RPLR algorithm is applied to the output error model (4.25). the 
analysis is quite analogous. The condition for convergence is that 


l 
=p T3 > Ve (114.45) 


where Fo(q) is the true denominator polynomial of the system (Problem 11A.2). 


Local Convergence of RPLR 


The ODE (11A.38) can be linearized around the desired convergence point @). It can 
be verified (Problem 11A.3) that the stability properties of the linearized equation 
are entirely determined by the matrix — G —1(@) G (6). If this matrix has eigenvalues 
in the right half-plane, then, according to (11A.16), 6(t) cannot converge to 6). In 
some special cases, the eigenvalues of this matrix can be explicitly calculated, and 
cases can be constructed for which the RPLR estimate 6(t) cannot converge to its 
true value. See Ljung and Söderström (1983), Example 4.11 and Stoica. Holst, and 
Söderström (1982). 


Conclusions 


In this appendix we have given three basic results relating to the asymptotic properties 
of recursive identification algorithms. We should point out that the results have not 
been phrased with full rigor and with all technical conditions. See, for example, Ljung 
and Söderström (1983)for such an account. 
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11A PROBLEMS 


11A.1 


11A.2 


11A.3 


11A.4 


11A.5 
11A.6 


Consider an RPLR for an ARMAX model (the ELS algorithm). Suppose that the 
vector g(t) in (11.57) is replaced by g*(t) = L(q)g(t). Show that the convergence 
condition (11.58) is changed to 


1 l 
Re——_.—- — = > 0w 

L{e'®)Cole'”) 2 
Carry out the convergence analysis for the RPLR method applied to the output error 
model (4.25) and show that the condition is given by (11A.45). 
Consider the ODE (11A.38) and lineanze it around 6 = 6). Show that the linearized 
equation ts asymptotically stable if and only if the eigenvalues of the matrix 

—G~'(6)G (6) 


are in the left half-plane. [G and G are defined by (11A.36) and (11A.37).] 


Linearize the ODE associated with an RPEM around a true parameter value 4,,. and 
show that the linearized equation is always stable. 


Verify (11A.43). (Reference: Ljung. 1977a). 


Consider the recursive Gauss-Newton prediction error algorithm (11.44) with a con- 
stant small gain y(t) = yp. Assume that the true system is constant and corresponds 
to the parameter vector 0. Let Ee, = Ee(t. 0o) = ào. Use expression (11A.29) to 
show that 


a A Fon = -1 
E (ê) — &) (60) - 0) = S -ao [ Eve. my, 8] 
(Hints: Recall (11.62), use the approximation 


R (7 [ER] , and assume y(t, 8o) to be stationary.] 


j 
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OPTIONS AND OBJECTIVES 


The goal of the identification procedure is. in loose terms. to obtain a good and 
reliable model with a reasonable amount of work. To this end a number of different 
techniques have been developed. as we saw in Part I]. They also contain a number of 
design variables to be chosen by the user. In this part we shall discuss how to make 
these choices so as to achieve our goals. In the present chapter we start by pinpointing 
the options that are available to the user and by formalizing the objectives of the 
identification exercise. In the latter context we concentrate on linear time-invariant 
structures, 


12.1 OPTIONS 


When confronted with a process with unknown dynamical properties. the user has 
to take a number of decisions as we discussed in Section 1.4: An experiment has to 
be designed, a model structure must be chosen, a criterion of fit must be selected, 
and a procedure for validating the obtained model has to be devised. As Figure 1.10 
indicates, these choices may also have to be revised a number of times during the 
identification procedure. 

The options include how to perform the identification experiment. what model 
structures to choose. what identification algorithms to apply, and how to validate the 
obtained model. We shall collectively refer to all these options or design variables 
as 


D = {all design variables} 


Each of the many choices will have an influence on the quality of the resulting 
model. Using the asymptotic theorv of Chapters 8 and 9, it is possible to evaluate 
the effects, and give advice about suitable choices of D. Indeed. this is the objective 
of the theory. This sets the scenario for the chapters to follow. 
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12.2 OBJECTIVES 


What do we mean by “a good and reliable model” and what is “a reasonable amount 
of work”? Both issues have a certain amount of subjective flavor, and it is not possible 
and desirable to give a fully formalized discussion on this topic. We shall, however, in 
this section study the quality of a model as related to its intended use. Our discussion 
will be confined to the case of linear single-input. single-output systems and models 


The True System and the Model 


To assess the quality of a certain model, it is unavoidable that some assumptions, 
implicit or explicit. about the true properties of the process have to be invoked. For 
the present purposes we shall suppose that the true system is subject to assumption 
$1 of Chapter 8; that is, 


y(t) = Go(q)utt) + Ho(g)eo(t) (12.1) 


where {eo(t)} is white noise with variance Ao. 

Clearly, the realism of such an assumption can be questioned. However. let 
us reiterate the point we made in Section 8.1: Analysis pertains to assuming cer- 
tain properties of the true data-generation mechanism and subsequently calculating 
the resulting properties of the models. Such calculations turn out to be useful and 
Suggestive, even when the underlying assumption may not be verifiable. 

For simpler notation, we shail use. as before, 


To(q) = [Golqg) Ao(q)] (12.2) 

Suppose that we have decided on all the design variables D, and as a result obtained 
the model j 

T(q.D) = [Giq.D) Alq, D)]) (12.3) 


Recall that D contains, among other things. N. the number of data, and the mudel 
orders. 


Scalar Design Criterion 
It is desirable that the model Tiq. D) be close to 7o(qg). The difference 

Tel’, D) = Tel”. D) — Tle”) (12.4) 
should. in other words, be small. Let us develop a formal measure of the size of T. 
Depending on the intended use of the model, a good fit in some frequency ranges 


may be more important than in others. To capture this fact. we introduce a frequency 
weighted scalar criterion 


AC, D)) = f] l Tlie, D)C(w) Te”, D) dw (12.5) 


-n 
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where the 2 x 2 matrix function 


_ | Cute) Criw) 
ae) = pee | tee) 


describes the relative importance of a good fit at different frequencies, as well as the 
relative importance of the fitin G and H, respectively. We shall generally assume 
that C(w) is Hermitian: that is. 


Calw) = Celw) [= Ci2(-@)] 


(The last equality follows when the dependence on w is via e”.) We shall shortly 
give examples of how such weighting functions can be determined. 
The scalar JTC. D)) is a random variable due to the randomness of T. To 


obtain a realization independent quality measure, it is natural to take the expectation 
of Jı. and form the criterion 


J(D) = | ET le. DCT e, D) dw 


n 
n 


tr[N(@w. D)C(w)] dw 


-n 


where the 2 x 2 matrix IT is given by 
Mw, D) = ET (e2, DT (e2. D) (12.8) 


The problem of choosing design variables can now be stated as 


where A denotes the constraints associated with our desire to do at most “a reason- 
able amount of work.” These will typically include a maximum number of samples, 
signal power constraints, not too complex numerical procedures, and so on. The con- 
straints A could also include that certain design variables simply are not available to 
the user in the particular application in question. 

The problem (12.9) will be discussed in Chapters 13 to 16. First, we give some 
examples of model applications that lead to different functions C (w) in (12.7). 


Model Applications 


In Chapter 3 we listed some typical uses of linear models. They all give rise to 
different weighting functions C (w) in the objective criterion (12.7). 
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Example 12.1 Simulation 


Suppose that the transfer function G is used to simulate the input-output part of the 
system with input u*(z) as in (3.2). The model G(qg, D) then produces the output 


yott) = Gq, Dyu* (1) 
while the true system would give the correct output 
yot) = Go(q)u*(t) 
The error signal 
šol) = yoti) — vot) = [Ĝu D) - Gog) | uo) 
has the spectrum 
Pilo, D) = |Gle'*. D) - Gole)| Oro) (12.10) 


where ®* (w) is the spectrum of {u*(t)}. This, again, is a random function, and its 
expectation w.r.t G, 


ma: 5 2 
W5(w,D) = E|G(e*, D) — Gole)| dzo) (12.11) 


is a measure of the average performance degradation due to errors in the model G. 
Note that, with (12.8) and 


_ | $w) 0 ata 
Clw) = l 0 A / (12.12) 

we can rewrite (12.11) as 
Pltw. D) = triw, D)C (w) (12.13) 


Finally, the average variance E° (t) (averaged over {u*(t)}, as well as over G) will 
be 


zt 
2n E5 (t) = J(D) = | W;(w, D) dw (12.14) 
cary S 
which is a special case of (12.7). This illustrates how the quadratic design criterion 
(12.7) may have an explicit physical interpretation. Z 
Example 12.2 Prediction 


The one-step-ahead prediction is given by (3.20}: 


$l- 1) = AN@GE@uw@) + [1 — Ha] yo) (12.15) 
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when applied to input-output data {u*(t), v*(t)} generated by the system. The 
discrepancy between the prediction }p(t|t — 1) obtained by the model T (q, D) and 
the true prediction ¥o(t|f — 1) thus is, suppressing arguments, 


ip(ele — 1) = [AG - Hy 'Go|u* + [He! - fol 
(12.16) 
= H)'(y* — Gou”) — Hy" — Gu’) 
The input-output data obey 
y"(t) = Go(q)u*(t) + Ao(q)eo(t) (12.17) 


which gives 


tot — 1) = Aa'Gu* + (1 — A~! Ho)eo 


ll 


ry—1 | æ+ r ^l ~ u*(t) (12.18) 
A [Gu + Äe] = iq. Dia D| 


This signal has the spectrum 


Pi(@) Pi,(@) 


TO: P) = Piel w.) Ao 


Te. D) | | Tee., D) 


|e. D)| 


where P*(w) is the spectrum of {u*(t)} and P7, (w) the cross spectrum between 
{u*(t)} and {e9(t)}. Due to the appearance of H inthe denominator, this expression 
is not quadratic in the model error. However, assuming the error to be small so that 
higher-order terms of T are neglected. we can replace H by Ho. This gives the 
(approximate) average spectrum of the error signal 


P(w., D) = tlw, D)C(o) (12.19) 
with 


Clw) 


3 | Palo) | (12.20) 


~ [Hoel |? LE Ao 


The average variance of the error signal E KAGU — 1) is thus approximately given 
by the criterion (12.7) with (12.20) for small errors. D 


In this way different intended mode! uses lead to the criterion (12.7) and (12.8) 
with different weighting functions C (w). See Problem 12E.1 for a general interpreta- 
tion of (12.7) and Problem 12E.3 for its connection to the accuracy of the underlying 
parameter estimate. 
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12.3 BIAS AND VARIANCE 


In this section we shall discuss the frequency-domain objective criterion (12.7) and 
(12.8) a bit more closely for parametric identification methods. As in (4.111). let 


T(qg.9) = [G(4.0) H(q.9)] (12.21) 
and let by; (D) be the estimate resulting from one of the methods described in Chapter 


7. Here we choose to show the N -dependence explicitly. Thus the transfer function 
estimate (12.3) is 


Ty(e®. D) = Te’, Ôn (DY) (12.22) 
Let us develop an expression for the mean-square error (MSE) Myto. D) 
in (12.8). According to Chapter 8, On (D) converges w.p. 1 to a value 0*(D). With 


T’(e'®, 0) as the d x 2 derivative of T w.r.t 8, defined in (4.125). the mean-value 
theorem gives 


T(e”.On(D)) ~ T(e'”, 0*(D)) 
A Fe sg 
+ [êD -0D ] Teen) a223) 
Introduce the notation 


Be’, D) = T(e'”.6*(D)) — Tole”) 


r 


for the model discrepancy in the limit N = oc. Then for (12.8) we obtain 


TIy(w, D) = Ble”, D) Ble’, D) + = Plo. D) (12.24) 
where 
P(w. D) = Te, 6*(D)) [x Cov ôv(D)| T(e?.6*(D)) (1225) 
In (12.24) we have neglected the term 
2Re B'le’, D) | Edu(D) á 6*(D)| T (e. 6*(D)) 


which, in view of (9B.13) will be dominated by either of the terms of (12.24) for large 
N, under the assumptions of Theorem 9.1. Notice that this theorem also gives us 
an expression (9.15) for determining P(w, D) explicitly. With (12.24). the design 
criterion (12.7) and (12.8) takes the asymptotic form 
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TD) ~ J(D) = Jp(D) + Jp(D) 


Ip(D) = al tr[P(w. D)C(w)| dw (12.26) 


Jp(D) = J | Bel’. D)C(w) Be"? D) dw 


T 


Notice that. using (12.25), 


1 
Jp(D) N tr Pa(D)Co 


C = J l T (e, 0*(D))C(w)T e2. 0D) dw (12.27) 


PAAD ~N. Cov 6y 


The expression for J (D) emphasizes the basic feature of a “variance contribution” 
Jp anda “bias contribution“ Jg to the objective criterion. These two components are 
typically affected by the design variables in somewhat different ways. The bias term is 
mostly affected by the model set (a large. flexible, and/or well-adapted model set gives 
small bias) and is typically unaffected by data record length. signal powers, and so on. 
The variance term. on the other hand, typically decreases with increasing amounts 
of data and input signal power, while it increases with the number of estimated 
parameters. 


We shall discuss various aspects of the subproblem 
min Jp(D) (12.28) 
Dea 
in more detail in Chapters 13 and 15, while the subproblem 
min Jg(D) (12.29) 
DEA 


is dealt with in Section 14.5. where we shall also comment on how to combine the 
results on the subproblems to solve 


min J (D) (12.30) 
DeA 
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Asymptotic Expression for Jp (D) 


With the expression (9.92) we have. asymptotically in the model order n and the data 
record length N 


53 
P elw) Ao oe ) 
Here n. N, the input spectrum @,„, and the cross spectrum „e are all variables 
contained in D. On the other hand. these design variables are the only ones that 
affect this asymptotic expression for the covariance matrix. There are many impor- 
tant design variables (such as prefilters, noise models, and prediction horizons» that 
do not affect P(w. D) in this asymptotic form. 
Inserting (12.31) into (12.26) gives the explicit expression 


Jp(D) x n f {AoC 11 (@) — 2Re[Ci2(w) P,.(—@)] + Cr(w) ®,(@)} Do) 
P REG = a 2 
-T AgPy(w) — \Pue(w)|" 


1 AS n p, (w) Puel —w) = 


(12.32 
using (12.6). 
We shall discuss this expression for the variance contribution in more detail in 
Section 13.6. 


12.4 SUMMARY 


The many identification methods potentially available for a particular application 
can be described as a list of choices and options (“design variables” D). 

An ideal route to determining these design variables would be to pose a cri- 
terion for what a “good model” {for the application in question) is and to list the 
constraints that are imposed on the design by limited time and cost. as well as the 
availability of the system. In that case the “best” identification result can be secured. 
We have sketched such a route in this chapter [see (12.26) to (12.31)]. and we shall 
pursue the choice of design variables in more detail in the following chapters. In prac- 
tical application, a less formal attitude will of course be taken, but our formalization 
is useful to bring out the character of the considerations that have to be made. 


12.5 BIBLIOGRAPHY 


The particular way of describing the available design variables and the formalizauion 
of a mean-square identification objective are further described in Ljung (1985b). 
Ljung (1986). and Yuan and Ljung (1985). 


12.6 PROBLEMS 


12E.1 Let s(t) be a signal derived from a model application. It could. for example. be the 
output of a system when a minimum variance regulator. computed using the model. is 
applied. Conceptually, we may write 


s(t) = f (T(q)) w(t) 


12E.2 


12E3 
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to denote that the transfer function T., as well as some additional signals w(t) (refer- 
ence signals and/or noises), are used to determine s(t). Assume that the difference 
Tiq. D) — T(q) is small so that effects higher than second order can be neglected. 
Then derive an expression for the expected spectrum of the “performance degradation 
signal” 


Aste) = | fÊ a. D) - f (Toa))] wit) 


(see Ljung. 1985b). 

Assume that the obtained model is going to be used for prediction on input-output 
data with the same second-order properties as those used during the identification 
experiment (see Example 12.2). Assume also that the asymptotic variance expression 
(12.31) is approximately applicable and that the bias contribution can be neglected. 
Give an expression for the expected spectrum of the error between the ideal prediction 
and the one obtained using the model. 


Suppose that S € M so that, for some 6p. 
To(q) = T(q.%) 


Then derive an expression for the criterion (12.7) and (12.8) in terms of the mean- 
square parameter error 


Te = E(@y — %)(On — 9%)" 
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EXPERIMENT DESIGN 


The design of an identification experiment includes several choices, such as which 
signals to measure and when to measure them and which signals to manipulate and 
how to manipulate them. It also includes some more practical aspects. such as how 
to condition the signals before sampling them (choice of presampling filters). 

In a sense. the design variables associated with the identification experiment 
are more crucial than many of the other variables described in Section 12.1. While 
several different design vartables associated with models and methods can be tried 
out at the computer. the experimental data can be changed only by anew experiment. 
which could be a costly and time-consuming procedure. Therefore, it is worthwhile 
to design the experiment thoughtfully so as to generate data that are sufficiently 
informative. . 

In this chapter we shall discuss the different choices/that concern experiment 
design. Some basic principles are discussed in Section 13.1, while the concept of 
informative experiments is treated in Section 13.2. Open loop input design is studied 
in Section 13.3. Identifiability issues for closed loop data are discussed in Section 
13.4 while a review of methods for identification of systems operating in a closed 
loop is given in Section 13.5. Experiment design based on the asymptotic expression 
(12.32) is treated in Section 13.6. The choice of sampling interval and sampling filters 
is discussed in Section 13.7. 


13.1 SOME GENERAL CONSIDERATIONS 
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Design Variables 


When confronted with a physical system whose dynamics is to be identified. there 
are a number of questions to be answered. First, the system definition may not be 
given: Which signals are to be considered as outputs and which are to be considered 
as inputs? This is the question of where in the process the sensors should be placed 
(outputs) and which signals should be manipulated (inputs) so as to “excite” the 
system during the experiment. It should also be stressed that there may be signals 


Sec. 13.1 Some General Considerations 409 


associated with the process that rightly are to be considered as inputs (in the sense 
that they affect the system), even though it is not possible. feasible, or allowed to 
manipulate them. If they are measurable, it is then still highly desirable to include 
them among the measured input signals and treat them as such when building models, 
even though from an operational point of view they should rather be considered as 
(measurable) disturbances. See Figure 1.1. 


When it has been decided upon where and what to measure, the next question 
is when to measure. Most often the signals are sampled using a constant sampling 
interval T, and then this quantity has to be chosen. 


The choice of input signals has a very substantial influence on the observed 
data. The input signals determine the operating point of the system and which parts 
and modes of the system are excited during the experiment. The user's freedom in 
choosing the input characteristics may vary considerably with the application. In 
process industry, it may not be allowed at all to manipulate a system in continuous 
production mode. For other systems. such as economic and ecological ones. it is sim- 
ply not possible to affect the system for the purpose of an identification experiment. 
In laboratory applications and during development phases of new equipment. on 
the other hand, the choice of inputs is perhaps not restricted other than by power 
limitations. 


Two different aspects are associated with the choice of input. One concerns 
the second-order properties of u, such as its spectrum ®,,(w) and the cross spectrum 
P,.(w) between input and driving noise (realized by output feedback). The other 
concerns the “shape™ of the signal. We can work with inputs being sums of sinusoids, 
or filtered white noise, or pseudorandom signals, or binary signals (assuming only 
two values), and so on. 

As a final choice for the identification experiment, let us list N, the number of 
input-output measurements to be collected. 


Basic Guiding Principles 


Several of these listed choices will be dealt with in more detail in the ensuing sections. 
We shall here, however, point to some guiding principles. 


Let us denote all the design variables associated with the experiment by X. 
(X is thus a subset of D, defined in Section 12.1.) The asymptotic properties of the 
resulting estimate can then be described by 


6*(X) (13.1) 
the limit to which Ôn converges, and by 

Pa (X) (13.2) 
the asymptotic covariance matrix of the parameter estimate (see Chapters 8 and 9). 


These expressions can then be translated to other quantities of interest, such as the 
resulting transfer-function estimate (see Chapter 12). 
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Stretching the formal results obtained in Chapters 8 and 9 to more suggestive 
formulations. we could say that, for PEMs 


The model M(@*(X)) is the best approximation of the system under the 
chosen X. (Note that what ts the “best approximation” of a svstem nor- 


mally depends on the applied input, see Example 8.2). (73.3) 
M(0*(X)) = S if M is large enough to contain S and X is such that no 
other model is equivalent to the svstem under X. (13.4) 
-1 
P(X) ~ 7 (2 ¥(7|0) d sao) l 
~A — y — t >e 
A 0 We?! 76° (13.5) 


See Theorems 8.2, 8.3, and 9.1 and (9.17). respectively. 


Bias 

The formulation (13.3) suggests that when the bias may be significant it is wise to let 
the experiment resemble the situation under which the model is to be used. This may 
of course be difficult to accomplish. since often the objective with identification is to 
find out suitable operating conditions. If the true system is suspected to be nonlinear 
and a linear model is sought. then the result (13.3) gives the reasonable advice that 
the experiment should be carried out around the nominal operating point for the 
plant. For a linear system. the issue of bias distribution and how it depends on the 
input will be further discussed in Section 14.4. 


Informative Experiments 


The issue (13.4) relates to the concept of informative data sets defined in Definitions 
8.1 and 8.2. Clearly. a primary goal is to design experiments that lead to data by 
which we can discriminate between different models in intended model sets. This 
problem will be further discussed in Section 13.2. Notice that when S € M then 
the issues (13.3) and (13.4) leave the choice of X open within the set of sufficiently 
informative ones. 


Minimizing Variances 


Once X is chosen so that the limiting model 0*(X) is acceptable. but only then. 
it becomes interesting to further select X so that the covariance matrix Pa(X) is 
minimized. Formally, the problem of optimal input design could be stated as 


min æ (Pe(X)) (13.6) 
XEX 


where a(P) is a scalar measure of how large the matrix P is and X is a set of 
admissible designs subject also to the constraints that (13.3) and (13.4) might impose. 
We gave an explicit example of a(P) in (12.27), of the kind 


a(P) = trCP 


The expression (13.5) gives a suggestive hint for the choice of X: 
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A small variance in a certain component of @ results if the predictor is sen- 
sitive to that component. Hence choose the outputs v(¢) and the inputs u(r) 


so that the predicted output becomes sensitive with respect to parameters that 
are important for the application in question. (13.7) 


This advice could be used as a general mental picture for experiment design. It 
applies to the sensor location problem, the selection of input variables as well as 
their characteristics, and to other design issues. The mathematical formalization of 
this observation is conceptually straightforward but may be technically involved. In 
Sections 13.3 and 13.6 we shall illustrate the formalization for the open-loop input 
design problem, 


Validation Power 


Another leading principle in experiment design is that the input—the “probing 
signal”—should be rich. It should excite the system and force it to show its properties, 
even the ones that may be unknown. This desired property mav be in conflict with 
the bias and variance aspects. though. For example. if we seek a variance-optimal 
input to identify a second order system, the solution may be a signal consisting of 
two sinusoids. Such a signal will never reveal if the system would be of higher order: 
we cannot invalidate a second order model. Similarly. an optimal input for a linear 
system may be one that assumes onlv two values. and shifts between those is a certain 
fashion. With such an input we can never find out if there is a static nonlinearity at 
the input side, since this would simply shift the two levels. All this illustrates that the 
input should have validation (and invalidation) power to test possible properties of 
interest in the system. 


13.2 INFORMATIVE EXPERIMENTS 


In Section 8.2 we introduced the concept of data sets Z% that are “informative 
enough” with respect to a model set M*, meaning that the data allow discrimination 
between any two different models in the set. We shall transfer this terminology 
to identification experiments by calling an experiment “informative enough” if it 
generates a data set that is informative enough. 

Clearly, it is a very basic requirement on the design that the experiment should 
be informative enough with respect to all model sets that are likely to be used. In 
Theorem 8.1 we gave a general result on informative experiments. We shall develop 
more detailed and specific characterizations of such experiments in this section. 


Open-Loop Experiments 


Consider a yet unspecified model set of single-input. single-output linear models: 


M* = {G(q, 8). H(q,@)|0 € Dag} (13.8) 
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Suppose that 6, and @ correspond to different models in M*. let e;(1) = €12. 6). 
Gi(q) = G(q.6;): AG(g) = G2(g) — G\(qg). and H; analogously. Then 


1 
Ae(t) = e(t) — e(t) = —— [AG(q)u(t) + AH (qe) 
A (q) 


Now 


1 
ex(t) = ———~ [(Golq) — Gr(q)) u(r) + Holq)eolt)) 
An(q) 
where Go. Ho is the true description (8.7) of the system. which need not belong to 
(13.8). Suppose that the experiment is carried out in open loop so that {u(7)} and 
{e,(t)} are independent. Then, using (2.65) and Theorem 2.2. 


— 1 T 1 N G iw Gs ia i 2 
E (Ae)? = a ea AG(e'®) + sl i ae ' AH(e'") P. tw) 
27 Jia |H, (ei) H(e'®) 
H ei? 2 
+ |AH(e)|’ E a to [au (13.9) 


where Ap = Ee5(t). According to our standard assumptions on invertibility of the 


noise model. | Ho(e™) “ > 0, Yæ. Suppose now that the data are not informative 
with respect to M* so that 


E[Ae(t)? = 0 (13.10) 


even though AG(e’”) and AH (e’”) are not both identically zero. Equation (13.10) 
implies that both the terms within square brackets in (13.9) are identically zero. so 


AH(e'”) = 0 
which means that the first term takes the form 


|AG(e)|" bulo) = 0 (13.11) 


Í 


This is the crucial condition on the open-loop input spectrum ®,(@), which we 
shall develop further. If (13.11) implies that AG (e'®) = 0. then it follows that 
(13.10) implies that the two models are equal, and hence that the data are sufficiently 
informative with respect to M*. 


Persistence of Excitation 
Inspired by (13.11), we introduce the following concept: 


Definition 13.1. A quasi-stationary signal {u(r)}. with spectrum ®,,(@), is said to 
be persistently exciting of order n if. for all filters of the form 


M,(q) = mq 7' +... + mg” (13.12) 
the relation 


[Mn cei) ©,(@) = 0 implies that M,(e) = 0 (13.131 
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The concept can be given more explicit interpretations. Clearly. the function 
M,,(=z)M,,(z~') can have at most » — 1 different zeros on the unit circle (since one 
zero is always at the origin) taking symmetry into account. Hence u(t) is persistently 
exciting of order n, if $, (w) is different from zero on at least n points in the interval 
—r <w < n. Thisisa direct consequence of the definition. Consider, for example. 
with u consisting of n different sinusoids: 


u(f) = You cos(axt), a Fao. k Fj, o FO a AT (13.14) 
k=l 


According to Example 2.4. each sinusoid gives rise to a spectral line at @ and —a,. 
This signal is thus persistently exciting of order 27. If one of the frequencies equals 
0. the order drops to 27 — 1. since this only gives one spectral line. Similarly. if one 
of the frequencies equals (the Nyquist frequency). the order drops by (another) 
1. 

We also notice that. according to Theorem 2.2, Mp (e) |’ ®P,,(w) is the spec- 
trum of the signal v(t) = M,,(q)u(t). Hence a signal that is persistently exciting 
of order n cannot be filtered to zero by an (n — 1)th-order moving-average filter 
(13.12). 

Another characterization can be given in terms of the covariance function 
R,(t): 


Lemma 13.1. Let u(t) be a quasi-stationary signal, and let the n x n matrix R, 
be defined by 


R, (0) R,(1)... Rain — 1) 
R, = Ru (1) Rae Buln — 2) (13.15) 
R,(n — 1) Ry — 2)... R,,(0) 
Then u(t) is persistently exciting of order n if and only if R„ is nonsingular. 
Proof. Letm=[m, m ... mp]. Then R, is nonsingular if and only if 
m™R,m =0 >m =0 (13.16) 


It is easy to verify that _ 
m Rym = E [Mr (0u 


with M, defined by (13.12). Hence 
TD 1 > iw |2 
m' Ram = — | |M,(e')| ®,(w)dw 
2m Jx 


according to (2.65) and Theorem 2.2, so (13.16) can be rephrased as 
|Mn(e")|° ,(w) = 0 => M,(e) = 0 


which is the definition of persistence of excitation. = 
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It is useful to consider also a strengthened version of this concept. 
Definition 13.2. A quasi-stationary signal {u(t)} with spectrum ®,(w) is said to 
be persistently exciting if 


®,(w) > 0. for almost all w (13.17) 


“Almost all” means that the spectrum may be zero on a set of measure Zero (like a 
countable number of points). Note that a persistently exciting signal cannot loose 
this property by standard linear filtering, since a filter that is an analytic function can 
have at most a finite number of zeros on the unit circle. 


Informative Open-Loop Experiments 
With these concepts, it is now easy to characterize sufficiently informative open-loop 


experiments. We have the following results. 


Theorem 13.1. Consider a set M* of SISO models given by (13.8) such that the 
transfer functions G (z. 0) are rational functions: 
B(q.0) _ q™ (by + bag™' +... + bmg ™t) 


G(¢g.@) = Pa ra I eS a 
a) F(q.@) 1+ fig’ +...4+ Sarg ™ 


Then an open-loop experiment with an input that is persistently exciting of order 
ny + ny is sufficiently informative with respect to M*. 


Proof. For two different models. we have 


By(q) F2(q) — Bo(q)Fi(q) 


A = 
Sg) Fi (q) Folq) 


Hence (13.11) implies that 
[Bi (2) Fte) — By(e!) F,(e)|” &,(w) = 0 


Since this numerator is a polynomial of degree at most np + my — 1 (we can always 
shift n; — 1 steps, so as to conform with (13.12)), it follows from Definition 13.1 that 
it is identically zero. and hence AG(g) = 0. The theorem now follows from the 


discussion following (13.11). = 


Corollary. An open-loop experiment is informative if the input is persistently 
exciting. 


Note that the necessary order of persistent excitation equals the number of 
parameters to be estimated in this case. If the numerator and denominator of the 
model have the same number of parameters 7, then the input should be persistently 
exciting of order 27. This means that ®,,(w) should be nonzero at 2” points, which 
is achieved for the input (13.14). It is thus sufficient to use n sinusoids to identify an 
nth order system, a result that ties in nicely with the frequency analysis described in 
Section 6.2. 
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Theorem 13.1 covers. for example, the general model set (4.33) with its several 
special cases. [t should also be clear that by analogous techniques other structures 
can be treated. including multivariable ones. See Problem 13E.2. 


Remark. Persistence of excitation of an m-dimensional signal is defined anal- 
ogously to Definition 13.1: Let M, (q) be defined by (13.12) with m; as 1 x m 
row-matrices. Then {u(1)} is said to be persistently exciting of order n if 


M,(e!@)@,(w)M/ (e2) = 0 implies that M, (e°) = 0 (13.18) 


Lemma 13.1 has an immediate multivariable counterpart. 


13.3 INPUT DESIGN FOR OPEN LOOP EXPERIMENTS 


The requirement from the previous section that the data should be informative means 
for open loop operation that the input should be persistently exciting (p.e.) of a 
certain order; i.e.. that it contains sufficiently many distinct frequencies. This leaves 
a substantial amount of freedom for the actual choice, and we shall in this section 
discuss good and typical choices of input signals. 

For the identification of linear systems, there are three basic facts that govern 
the choices: 


1. The asymptotic properties of the estimate (bias and variance) depend only on 
the input spectrum—not the actual waveform of the input. 


2. The input must have limited amplitude: u < u(t) < u. 


3. Periodic inputs may have certain advantages. 


The first fact follows from (8.71) and (9.54). The second one is obvious from prac- 
tical considerations. The advantages and disadvantages of periodic inputs will be 
discussed later in this section. 


The Crest Factor 


The covariance matrix is typically inversely proportional to the input power. We 
would thus like to have as much input power as possible. In practice, the actual 
input limitation concerns amplitude constraints u and u. The desired property of 
the waveform therefore is defined in terms of the crest factor C,, which for a zero 
mean signal is defined as 


? t 
C= Se a a (13.19) 
limy x y Dopey WCE) 


(More sophisticated definitions only use the signal power in a certain frequency band 
of interest in the denominator.) A good signal waveform is consequently one that has 
a small crest factor. The theoretic lower bound of C, clearly is 1, which is achieved 
for binary, symmetric signals: u(t) = +i. 
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This gives a theoretical advantage for binary signals, and indeed, several of the 
signals that we will discuss in this section will be binary. However. the following 
caution should be mentioned: A binary input will not allow validation against non- 
linearities. For example. if the true system has a static non-linearity at the input (as 
in the Hammerstein model of Figure 5.1) and a binary input is used. the input is stil] 
binary after the non-linearity. and just corresponds toa scaling. There is consequently 
no way to detect that such a non-linearity is present from a binary input. 


The Frequency Contents of the Input 
Consider. as before. the general SISO model structure 
v(t) = Gig. O)utt) + H(g. @)e(t) (13.20) 


The asymptotic covariance matrix that results when a prediction error method is 
applied to (13.20) was computed in (9.29) and (9.30): 


PX) = K€) - Ewer, W(t. %)] (13.2 1a) 
g 2 

Poe aS (13.216) 
[EE eo] 


From (7.89) we can define the average information matrix per sample, M . as 


E 1 t= 
M(X) = lim —My = —Ey (t. o) W (t. 6o) (13.22) 
Nox N Ko 
where ; 
w | 
1 o e] Xx 
— =f ol g (13.23) 
Ko -x fe{x) 


and f(x) is the PDF of the true innovations. The important consequence of these 
expressions is that the choice of norm €(€) and the distribution of the innovations 
act only as input-independent scaling of the covariance matrix and the information 
matrix. Optimization measures based on M in (13.22) thus cover the Cramér-Rao 
lower bound for Pg(X) as well as all asymptotic expressions for P(X} obtained 
with prediction error methods, up to an X-independent scaling, that is immaterial 
for the experiment design. 

In (9.54) we gave an expression for the covariance matrix. This can be rewritten 
in terms of M(X } as 


Ao 


MX) = 27 K 
TK 


[ (ape, AAE by (w) 


+ Hye’. Go) [Hie O] ao} Dodo (13.244 
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provided u and ey are independent. Here G, and H, are the d x 1 gradients of G 
and H. Introduce 


Gale”. Oo) [Gyte™. e] 


M(w) = 1 l 13.25 
e Zrob (0) ee 
T 
-2 T H! io 6 H! 10 Q 
M. = f Hale’, 6) [Hae t] (13.25b) 
2 Ko >a P. (w) 
Then we have 
T 
M(X) = M(®,) = M(w)®,(w)dw + Me (13.26) 


-n 


This expression gives a good impression of how the input spectrum affects the infor- 
mation matrix in the open loop case. It ties nicely with the intuitive advice (13.7). To 
achieve a large information matrix, we should spend the input power at frequencies 
where M(q) is large, that is, where the Bode plot is sensitive to parameter variations 
(G, large). Put more leisurely. if a parameter is of special interest, then vary it and 
check where the Bode plot moves, and put the input power there. In many cases this 
may give sufficient guidance for good input design. In Section 13.6 we shall use a 
more formal approach for high order models, to obtain similar results. 

Notice also that (13.26) manifests the first fact listed in this section: The in- 
formation matrix/covariance matrix of the parameters depends only on the input 
spectrum—not the particular waveforms. 


Optimal Frequency Contents: Itis quite clear that formal optimal design problems 
can be formulated from (13.26): Pick a scalar measure of the size of the matrix 


M HD). like its (weighted} trace. its determinant or its matrix norm and minimize 
this with respect to the input spectrum ®,. There is quite an extensive literature 
on this. e.g. Goodwin and Payne (1977) and Zarrop (1979). One important issue 
in this context is how to consider a restricted, finitely parameterized set of inputs 
that still covers the set of achievable information matrices. A typical result is that it 
is sufficient to consider inputs that are a finite sum of sinusoids. as in (13.14). The 
number of necessary sinusoids depends on the system/model order. See, e.g., Stoica 
and Söderström (1982a). An efficient algorithm for selecting the frequencies on the 
DFT-grid is given in Section 4.3.4 of Schoukens and Pintelon (1991). 

It should be noted that the optimal input design will depend on the (unknown) 
system, so this optimality approach is worthwhile, only when good prior knowledge is 
available about the system. The optimum number of sinusoids may also correspond 
to the minimum one for identifying a system of this order. To allow for validation 
against higher order models, it is thus wise to use an input with more frequencies. 
In practice, it is suitable to decide upon an important and interesting frequency 
band to identify the system in question. and then select a signal with a more or less 
flat spectrum over this band. We will discuss such designs in the remainder of this 
section. 
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Common Input Signals 


The basic issue for input signal design is now clear: For linear system identification. 
achieve a desired input spectrum for a signal with as small crest factor as possible. 
Unfortunately these properties are somewhat in conflict: If it is easy to manipulate 
a signal's spectrum, it tends to have a high crest factor and vice versa. We shal! now 
describe typical choices of waveforms, and how to achieve desired spectra. 

A general comment is that it is always advisable to generate the signal and study 
its properties, off-line, before using it as an input in an identification experiment. ` 


Filtered Gaussian White Noise. A simple choice is to let the signal be generated 
as white Gaussian noise, filtered through a linear filter. With this we can achieve 
virtually any signal spectrum (that does not have too narrow pass bands) by proper 
choice of filters. Since the signal is generated off-line, non-causal filters can be applied 
and transient effects can be eliminated, which gives even better spectral behavior, 
See any book on filter design. like Parks and Burrus (1987). The Gaussian signal is 
theoretically unbounded, so it has to be saturated (“clipped”) at a certain amplitude. 
Picking that, e.g.. to be at 3 standard deviations gives a crest factor of 3, and at the 
same time. only an average of 1% of the time points are affected. This should lead 
to quite minor distortions of the spectrum. 


Random Binary Signal. A random binary signal is a random process which assumes 
only two values. It can be generated in a number of different ways. The telegraph 
signal is generated as a random process which at any given sample has a certain 
probability to change from the current level to the other one. Apparently, the most 
common way is to simply generate white, zero mean Gaussian noise. filter it by an 
appropriately chosen linear filter. and then just take the sign of the filtered signal. It 
can then be adjusted to any desired binary levels. The crest factor is thus the ideal 
1. The problem is that taking the sign of the filtered Gaussyan signal will change its 
spectrum. We therefore do not have full control of shaping the spectrum. In the 
off-line situation we can however always check the spectrum of the signal before 
using it as input to the process to see if it is acceptable. 


Example 13.1 Band-limited Gaussian and Binary Signals 


Suppose we seek an input with power concentrated to the band 1 < w < 2 (rad/s). 
Let e be generated as white Gaussian noise. Filter this signal through a 5th order 
Butterworth filter with the indicated pass band. This gives the signal in Figure 13.1. 
Its spectrum is shown in Figure 13.2. Taking the sign of this signal gives the random 
binary signal of Figure 13.1. Its spectrum is shown in Figure 13.2. The distortion of 
the desired spectrum is clear. J 


Pseudo-Random Binary Signal, PRBS. A Pseudo-Random Binary Signal is a pe- 
riodic, deterministic signal with white-noise-like properties. It is generated by the 
difference equation 


u(t) = rem(A(qg)u(t), 2) = rem(ajyu(t — 1) +... + apuft — n), 2) (13.27) 
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Figure 13.1 Upper plot: Random Gaussian noise. filtered through a pass-band of 
1 < w < 2. Lower plot: Random binary noise obtained by taking the sign of the 
signal in the upper plot. Both signals are plotted as piecewise constant signals. 


Here rem(x, 2) is the remainder as x is divided by 2, i.e., the calculations in (13.27) 
should be carried out modulo 2. u(t) thus only assumes the values O and 1. After u is 
generated, we can of course change that to any two levels. The vector of past inputs 
[u(r —1) ... u(t— n)] can only assume 2” different values. The sequence u 
must thus be periodic with a period of at most 2”. In fact, since n consecutive 
zeros would make further u`s identically zero, we can eliminate that state, and the 
maximum period length is M = 2” — 1. Now the actual period of the signal will 


10’ 


10°? 
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Figure 13.2 Spectra of the signals in Figure 13.1. Solid line: The spectrum of the 
Gaussian signal. Dashed line: The spectrum of the binary signal. 
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TABLE 13.1 A-polynomials that generate maximum 
length (=M ) PRBS for different orders n. 
a; = l for the indicated k and 0 otherwise. 
Several other choices may exist foreach n. 


Order n M=2"-1 ax non-zero for k 
2 3 1.2 
3 7 2.3 
4 15 1.4 
5 31 25 
6 63 1.6 
7 127 37 
8 255 1.2.7.8 
9 511 4.9 
10 1023 710 
11 2047 9.11 


depend on the choice of A (q). but it can be shown that for each n there exists choices 
of A(q) that give this maximum length. Such choices are shown in Table 13.1 and 
the corresponding inputs are called Maximum length PRBS. See Davies (1970). The 
interest in maximum length PRBS follows from the following property: 

Any maximum length PRBS shifting between +u has the first and second order 
properties 


i M z 
= X ut) = — 
M t=1 M 
j (13.28) 
ly Z k=0.4M.42M 
R,(k) = — y aut + k) = -s SA ` zr 
M -i else 


Here M = 2” — 1 is the (maximum length} period, and the summation is performed 
with periodic continuation of the signal. Note that the signal does not have exactly 
zero mean. Its covariance function thus differs from the second moment function 
(13.28). To compute the spectrum, we proceed as in Example 2.3. In the notation of 
that example, we find that 


M-1 


M-1 
: 1 . 
-ikw _ +2 = —ike 
DE (w) ) R, (ke =U i M 2 e~ j 


k=0 
i — e Me 
M 1-e7!e | 
TE for w = 0 
wl + 4) forw = 2xk/M, k =1....,M—1 


| 
=| 
N 
a | 
feat 
| 
|= 
D 
L 
€ 
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The expression (2.67) now gives the spectrum 
M-I 


vie lw —2ak/M). 0 < w< 22 (13.29) 
k=l 


_ 2am 


where we ignored terms proportional to 1/M?. In the region —7 < w < x there 
will be M — 1 frequency peaks (w = 0 excluded). This shows that maximum length 
PRBS behaves like “periodic white noise.” and is persistently exciting of order M — 1. 
Figure 13.3 shows one period of a PRBS and its spectrum. 


10° 


| T: 
1 i i 
l 
10° f 
10°! 
O 20 40 60 80 100 120 0 05 1 2 
{a) A PRBS with n = 7 and hence M= 127 (b) The spectrum of the signal, computed 


by spectral analysis and FFT. respectively. 
There are 63 peaks in the FFT spectrum 
for positive frequencies. 


Figure 13.3 A maximum length PRBS signal. 


Notice that it is essential to perform these calculations over whole periods. 
Generating just a part of a period of a PRBS will not give a signal with properties 
(13.28). 

Like white random binary noise. PRBS has an optimal crest factor. The advan- 
tages and disadvantages of PBRS compared to binary random noise can be summa- 
rized as follows: 


e If the PRBS contains whole periods. its covariance matrix will have a very 
special pattern according to (13.28). It can be analytically inverted. which will 
facilitate certain computations. 


e Asa deterministic signal, PRBS has its second order properties secured when 
evaluated over whole periods. For random signals, one must rely upon the law 
of large numbers to have good second order properties for finite samples. 
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e There is essentially only one PRBS for each choice of A(q). Different initia} 
values when generating (13.27) only correspond to shifting the sequence. It 
is therefore not straightforward to generate mutually uncorrelated sequences 
with PRBS. A simple way would be to excite one input at a time. or variants 
thereof. See Problem 13E.7. 


e For PRBS one should work with an integer number of periods to enjov its good 
properties, which limits the choice of experiment length. 


Low-Pass Filtering by Increasing the Clock Period. Jt is easy to generate binary 
signals with white noise second order properties. either as PRBS or by a random 
generator. To give the signal a more low-frequency character, we could filter it 
through a low pass filter. This would make the signal non-binary, though. with a 
worse crest factor. An alternative is to sample faster. i.e.. from the given PRBS. 
create a new signal u by taking P samples over each sampling period of the original 
signal e. The new signal will thus always stay constant over at least P samples. We 
have thus increased the sampling frequency to be P times faster than the frequency 
at which the PRBS is generated. This is usually expressed as having a clock period 
of P. It can be shown (see Example 5.10 in Söderström and Stoica, 1989) that the 
new signal u has the same covariance function as 


1 
u(t) = p elt) +... + et — P +1)) 


obtained by simple moving average low pass filtering of e. 

A typical advice is to let the clock frequency in the PRBS be about 2.5 times the 
bandwidth to be covered by the signal. Schoukens, Guillaume, and Pintelon (1993). 
Another advice is to sample about 10 times faster than the bandwidth to be modeled 
(see Section 13,7). Together this shows that a good choice is to take the clock period 
P = 4. A spectrum fora PRBS with P = 4 is shown in Figure 13.4. We see that the 
frequencies up to about 1/5 of the Nyquist frequency are well covered. 


10 
0.1 


0 1 2 3 


Figure 13.4 The spectrum of a PRBS signal with P = 4. 
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Multi-Sines. A natural choice of input is to form it as a sum of sinusoids: 


d 
u(t) = Yo ay cost + dy) (13.30) 
k=l 


Apart from transient effects this gives a spectrum according to Example 2.4: 


d 2 
Pulo) = 20 Y` [blw — ax) + Sw + ax)] (13.31) 
k=l 


With d. ag and w, we can thus place the signal power very precisely to desired 
frequencies. In addition, we mentioned in Section 13.2 that all possible information 
matrices can be obtained within the family (13.30) for large enough d. The only 
problem with this input is the crest factor. The power of the signal is a2. If 
all sinusoids are in phase. the squared amplitude will be ($ ax)*. The crest factor 
can thus be up to V/d (if all a, are equal). The way to control the crest factor is 
to choose the phases ø; so that the cosines are “as much out of phase” as possible. 
A simple solution is the so-called Schroeder phase choice. Schroeder (1970). which 
means that the phases are spread as follows when the amplitudes ag are equal: 


¢ arbitrary 


= (13.32) 
ee: 2< k <d. 


dy = Qh- 
Chirp Signals or Swept Sinusoids. A chirp signal is a sinusoid with a frequency 
that changes continuously over a certain band Q : w < w < œ over a certain time 
periodQ <t < M: 


u(t) = Acos (at + (@ — @;)17/(2M)) (13.33) 


The “instantaneous frequency“ w; in this signal is obtained by differentiating the 
argument w.r.t. time t: 


t 
wi = wW + ye” — w) 


and we see that it increases from œw; to œ. This signal has the same crest factor as 
a pure sinusoid. i.e.. J2. and it gives good control over the excited frequency band. 
Due to the sliding frequency, there will however also be power contributions outside 
the band 22. 


Example 13.2 Sinusoids and Swept Sinuoids 


In Figure 13.5 we show the signal which is obtained from (13.30) with 10 frequencies 
of equal amplitude (= 1). equally spread over the frequency band 1 < w < 2 
rad/sec. The three cases correspond to ¢ = Q. the Schroeder choice (13.32). and 
randomized phases, respectively. They all have the same spectrum, shown in Figure 
13.5d. Figure 13.6a shows the chirp signal (13.33) over the chosen band, while its 
spectrum is shown in Figure 13.6b. o 
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200 220 240 260 280 300 200 220 240 260 280 300 


(a) Sum of 10 sinusoids of equal amplitude {b) Same as (a), but with Shroeder phases 
over the frequency band 1 = w = 2 rad/sec. 
All phases equal to zero at the starting time. 


0 05 1 fis 2 25 3 325 


(c) Same as (a), but with random phases. (d) The spectrum of all the signals. 
Smooth line: estimated with spectral 
analysis. Rough line: Computed by FFT. 


Figure 13.5 Input signals that are sums of sinusoids. 


Periodic Inputs 


Some of the signals above are inherently periodic. like the PRBS, or the sum of 
sinusoids. All of them can in any case be made periodic by simple repetition. To 
retain the nice frequency properties they have been designed for, the following facts 
must be taken into account when creating periodic signals: 


e The PRBS signal must be generated over one full period. M = 2” — 1. and ther 
be repeated. This follows from the discussion of its second order properties. 


e Tocreate a multi-sine of period M . the frequencies a, in (13.30) must be chosen 
from the DFT-grid wg = 270/M,£=0,1....,M—1. 


0.8 
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(a) Portion of the signal (b) The spectrum of the signal, computed 
by spectral analysis and FFT. respectively. 


Figure 13.6 A chirp signal covering the frequency band 1 < w < 2. 


To make the chirp signal (13.33) nicely periodic with period M, w, and œw must 
be chosen as 2:7k;/M for some integers kı and k2. The signal generated by 
(13.33) can then be repeated an arbitrary number of times. 


To display the spectrum of a periodic signal of length N with an even number of 
periods without any leakage. it should be computed for the (DFT) frequencies 
wg = 2k/N. (“Leakage” means that the Fourier transform is distorted by 
boundary effects. Note that the “ringing” in Figures 13.5d and 13.6b is due to 
leakage.) 


What are the advantages and disadvantages with periodic inputs? 


e A signal with period M can have at most M distinct frequencies in its spectrum. 


It is thus persistently exciting of. at most. order M. In this sense, non-periodic 
inputs inject more excitation into the system over a given time span. 


When a periodic input has been applied, say K periods each of length M 
(N = KM), itis usually advisable to average the output over the periods. and 
work only with one period of input-output data in the mode! building session. 
This gives less data to handle. The signal to noise ratio is improved by a factor 
of K by this operation, at the same time as the data record is reduced by the 
same factor. No difference in asymptotic properties should thus result from this 
(unless the noise model and the dynamics model share parameters). However, 
several methods have a performance threshold for finite samples and poor 
signal-to-noise ratios, so in practice there might also be an accuracy benefit 
from averaging the measurement over the periods. 


A periodic input allows both formal and informal estimates of the noise level 
in the system. After transient effects have disappeared, the differences in 
the output response over the different periods must be attributed to the noise 
sources. This could be quite helpful in the model validation process for the 
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important distinction between model errors and noise (cf. Chapter 16). More 
formally. let the output be 


y(t) = Yalt) + v(t) 


where \,(¢) is the noise-free part that originates from the input. It is thus 
periodic and a natural estimate is 


: , Ka 
fit) = — Dive tem sts M (13.34) 
k=0 
and periodically continued for larger t. This gives the noise estimates ¢(r) = 


y(t) — ¥,(t) from which both noise levels and noise colors can be estimated. 
The noise variance Ay, €.g.. is estimated as 


1 KM 
Ne eee) A 
t=] 


è When the models are estimated in terms of Fourier transformed data (see Sec- 
tion 7.7). periodic signals give no leakage when forming the Fourier transforms, 


Intersample Behavior of the Input (*) 


So far we have just considered the discrete-time properties of the input u(t). t = 
1.2, ..., assuming a unit sampling interval. What will be applied to the actual process 
is of course a continuous-time signal u(t) defined for all real ¢ in a certain interval. 
To stress this point, we shall in this subsection use the notation uz. k = 1.2,... for 
the input sequence and use u(t) to denote the continuous time input. The spectrum 
pi of the sequence uz. k = 1,2, ... is defined by (2.63): 


ofa) = $ Rei f (13.35a) 
f=-cc 
1 N 
Re = lim — = 3.356 
ee ola dt (Uk (13.355) 


This is the spectrum we have designed in this chapter, and this is also the spectrum 

that determines the model quality as in (13.26). It is only relevant to consider this 

spectrum over |w| < 2. since pi (w) by definition will be periodic with period 27. 
The spectrum of the continuous time signal is defined analogously as 


x 
Pi (w) =f Ro(r)e dr (13.36a) 
-0% 
1 N 
Ri(t) = lim ral u(t — t)u(t)dt (13.36b) 
N>X N 0 


If we choose a sampling interval T , it is natural to construct a continuous time signal 
u(t), such that u(kT) = u,z.k = 1,2,.... This can be done in several different 
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ways. depending on how we select the intersample behavior. The simplest case is to 
let the input be constant between the sampling instants 


u(t) =u, if kT <t <(kK+ NT (13.37) 


where T is the sampling interval. Such an input is called zero order hold. ZOH. If 
the continuous-time input is defined by linear interpolation between the sampling 
instants, we have the first order hold, FOH. case: 
(t —kT)ugg, FCAT HT — thy 
u(t) = ieee if AT <t < (K+1)7T (13.38) 
The third common choice is to let the continuous time signal be band-limited. This 
means that the continuous time signal has no power above the Nyquist frequency. 
and that its spectrum coincides with the spectrum of the discrete time signal. up to 
this frequency: 


O° lw) = O(wT) for jo] < 2/T, $ (w) = 0 else (13.39) 
u u u 


We may think of this signal as obtained as a sum of sinusoids (up to frequency 7T) 
u(t) adjusted so that u(k7) = ug. k = 1,2,.... This is trigonometric interpolation. 
Note the following aspects: 


e For first and zero order hold, the discrete signal spectrum T $f (wT) does not 
coincide with the continuous one $‘ (w) even for |w| < 2/7. The reason 
is. loosely speaking, that the abrupt changes in the ZOH signal create new 
frequencies in the continuous signal. 


e Tn all cases where the continuous time input can be constructed exactly from 
its values at the sampling points, it will be possible to form an exact discrete 
time model for how u(AT), k = 1.2.... affect (kT). k = 1,2.... (apart 
from the noise contributions. of course). The actual discrete time model will 
however depend on the intersample behavior. i.e.. a model which has a perfect 
fit for ZOH input will not describe the system under a FOH input. 


e The formulas for translating a discrete time model to continuous time will 
depend on the intersample behavior of the input for which the model was 
fitted. For a ZOH input we should use the inverse of (4.67)-(4.71). 

Note that the methods of this book also allow a direct fit of a continuous 
time model to discrete time data. It is just a matter of parameterization as in 
(4.65). However, the parameterization must be done using a sampling formula 
that is consistent with the true intersample behavior of the input. 


e When we build a discrete time model from u(k7T) = ug to y(kT ). it is the dis- 
crete signal spectrum o? that determines the model quality (uncertainty) as in 
(13.24). This follows from the analysis in Chapters 8 and 9 where the intersam- 
ple behavior does not enter. However, as noted above, the translation of the 
discrete time model to continuous time should depend on the input intersam- 
ple behavior. Therefore, the quality of the resulting continuous-time model 
will depend both on the input’s discrete signal spectrum and its intersample 
properties. 


428 


Chap. 13 Experiment Design 
Which aspects should then guide the choice of input intersample behavior? 


e Affinity to the model use. If the model is to be used for ZOH inputs. as js 
typical in all computer controlled applications, it is natural to use a ZOH input 
experiment to generate the identification data. Then the resulting model can 
be used directly in discrete time. 


e Ease and accuracy of input generation. he practical experiment equipment 
will of course also decide what is the natural choice of input character. Note 
that if the model is to be constructed as, or transformed to, continuous time. it 
is necessary that the assumed intersample behavior coincides with the actual 
one. Whatever choice is easier to implement accurately is then to be preferred. 


Finally. we may note that if the sampling rate is fast compared to the bandwidth 
of the system (as will be suggested in Section 13.7), the difference between various 
intersample behaviors may be insignificant. 


13.4 IDENTIFICATION IN CLOSED LOOP: IDENTIFIABILITY 


Tt is sometimes necessary to perform the identification experiment under output 
feedback, i.e., in closed loop. The reason may be that the plant is unstable. or that it 
has to be controlled for production, economic, or safety reasons, or that it contains 
inherent feedback mechanisms. 


In this section we shall study problems and possibilities with identification data 
from closed loop operation. In many cases we will not need to know the feedback 
mechanism, but for some of the analytic treatment we shall work with the following 
linear output feedback setup: The true system is / 


yxy) = Golg)u(t) + v(t) = Golg)u@) + Holg)e(t) (13.40a) 


Here {e(t)} is white noise with variance Ap. We shall also use the notation u(t) = 
Ho(q)e(t). The regulator is as in Figure 13.7: 


u(t) = r(t) — Fry(g¢)y(t) (13.40b) 


Here {r(t)} is a reference signal (filtered version of a setpoint, or any other external 
signal) that is independent of the noise {e(1)}. The model is 


y(t) = G(q.O)u(t) + A(q. @)e(t) (13.40c) 
We also assume that the closed loop is well defined in the sense that 


Either F,(q) or both G(q. @) and Go(q) contain a delay (13.40d) 


The closed loop system is stable (13.40¢) 
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v(t) 


-F,(q) 


Figure 13.7 Block diagram of a typical feedback system. 


The closed loop equations become 


yO) = Golg)Sol(gyre) + So(g)v(t) (13.41a) 
u(t) = So(q)r(t) — Fy(q)So(q) v(t) (13.41b) 


where So(q) is the sensitivity function 


1 
S, = —— 13.42 
oa) 1+ Fy(q)Go(q) 


In the sequel we shall omit arguments w, q, e’’, and t whenever there is no risk of 
confusion. 
The input spectrum is 


Pu = [W D, + [Fy IS x (13.43) 


Here ®, and ®,. are the spectra of the reference signal and the noise, respectively. 
We shall use the notation 


© = [b DE = F ISO, (13.44) 


to show the two components of the input spectrum, originating from the reference 
signal and the noise respectively. See also (8.74). 
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Some Basic Good News 


Most of the analytical development in Chapters 8 and 9 was done under general 
conditions that include closed loop data. The basic convergence theorem, Theorem 
8.2, applies also to closed loop data, and Theorem 8.3 tells us that a prediction error 
method will consistently estimate the system if 


e The data is informative (Definition 8.1) 
e The model set contains the true system 


regardless if the data {u, y} have been collected under feedback. Since the variance 
results of Chapter 9 (Theorem 9.1, equations (9.17) and (9.31)) apply also to closed 
loop data. we know that under the above assumptions the straightforward prediction 
error estimate will have optimal accuracy. 

We therefore need only look into what constitutes informative experiments 
under closed loop. and what the approximation aspects are when the model set does 
not contain the true system. 


Some Fallacies with Closed Loop Identification 


There are some fallacies associated with closed loop data: 


e The closed loop experiment may be non-informative even if the input in itself 
is persistently exciting. The reason then is that the regulator is too simple. See 
Example 13.3 below. 

e Spectral analysis applied in a straightforward fashion, as described in Chapter 
6, will give erroneous results. According to Problem 6G.1 the estimate of G 
will converge to 


io, Gole’?)%-(@) — Fleip, lo) 
G,(e") = RE 
$, (w) + | Fy (ei) E plow) 


e Correlation analysis. as described by (6.7)-(6.11) will give a biased estimate of 
the impulse response, since the assumption Eu(t)u(t — t) = 0 is violated. 

e For open loop data, output error models (see (4.25) and (4.117)) will give 
consistent estimates of G, even if the additive noise is not white. This follows 
from Theorem 8.4. This is not true for closed loop data. See (13.53) below. 

e The subspace method (7.66) will typically not give consistent estimates when 
applied to closed loop data. 


Example 13.3 Proportional Feedback 


Consider the first-order model structure 
y(t) + ay(t — 1) = butt — 1) + e(t) (13.45) 


and suppose that the system is controlled by a proportional regulator during the 
experiment: 


u(t) = —fytt) (13.46) 
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Inserting the feedback law into the model gives 
v(t) + (a + bf)xit — 1) = e(t) (13.47) 


which is the model of the closed-loop system. From this we conclude that all models 
(a. b) subject to 


â=a+yf 


b=b-y 


with y an arbitrary scalar. give the same input-output description of the system as the 
model (a. b) under the feedback (13.46). There is consequently no way to distinguish 
between these models. Notice in particular that it is of no help to know the regulator 
parameter f. The experimental condition (13.46) is consequently not informative 
enough with respect to the model structure (13.45). It is true. though, that the input 
signal u(r) is persistently exciting since it consists of filtered white noise. Persistence 
of excitation is thus not a sufficient condition on the input in closed-loop experiments. 

If the model structure (13.45) is restricted by. for example, constraining b to 


(13.48) 


be 1 
y(t) + ay(t — 1) = u(t — 1) + elt) 


then it is clear that the data generated by (13.46) are sufficiently informative to 
distinguish between values of the a-parameter. 0 
Informative, Closed Loop Experiments 


It could consequently be problematic to obtain relevant information from closed- 
loop experiments. Conditions for informative data sets must also involve the feed- 
back mechanisms. To get a feeling for the problem. consider (8.12) in Definition 8.1. 
If (8.12) holds for two different models, then 


E[AW,(g)y(t) + AWU = 0 (13.49) 


for some filters AW, and AW,, that are not both zero and that are of about the same 
complexity as the models in the model set. For the regulator (13.40b) this would 


mean that 
= | Go So Sor 
E|[AW, AW, 
AW, als Fs] e 


12 


0= Eliam, awal] 
l 


Dy 
la 


using (13.41). Let 
; a Go $% 
W=[W. Wl = [AW AW, 
[h = (am an E _ 2. | 
The determinant of the last matrix is — Go F, So — So = —1, so it is always invertible, 


which means that W = 0 will imply that both AW, and AW, are zero. Recall that 
r and v are uncorrelated by assumption, so 


aed 
v 
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The filtered noise v is persistently exciting. and if Sor is that too. then the conclusion 
is that W = 0. Under that assumption we have thus shown that (13.49) implies 
AW, = 0 and AW, = 0. Note also that Sor will be persistently exciting if r is. since 
the analytical function Sp can be zero at at most finitely many points. We summarize 
this result as a theorem. 


Theorem 13.2. The closed loop experiment (13.40) is informative if and only if y 
iS persistently exciting. 


Asa matter of fact (13.49) contains more information. This equation essentially 
implies that there is a linear, time-invariant, and noise-free relationship between y 
and u: 


AW, (q)u(t) x —AW,(q) y(t) (13.50) 


Therefore, only if there is a feedback like (13.50) during the experiment is the data 
set not informative enough. Thus not only external signals like r in the regulator will 
assure an informative experiment, but also nonlinear or time-varying or complex 
(high-order) regulators should, in general, yield experiments that are informative 
enough. This is the most general statement that can be made about informative 
experiments. 

A general result for the multivariable case with time-varying regulators can be 
formulated as follows: 


Let the input signal be given by output feedback plus an extra signal: 
u(t) = —F;(qg)v(t) + Kifg)r(t), ie ey eee 2 (13.51) 


Here F; and K; are linear filters that are changed during the experiment between 
r (different) ones. The changes are made so that each regulator is used a nonzero 
proportion of the total time. and so seldom that any high-frequency contributions to 
the signal spectra that arise from the shifts can be neglected. The dimension of the 
filters F; are m x p (m = dimu, p = dimy).and of K; are m x s ($s = dimr). 

Assume that the signal r is persistently exciting. Then the experiment is infor- 
mative if and only if 


ia ie [Kilet KI e.) + Filet) F] (e!®)] = ae F;(e'”) sf 
= Ža FJ (e') Psd 

(13.52) 

for almost all frequencies. See Söderström, Ljung, and Gustavsson (1976)for a proof. 

An interesting consequence of this result is that, even if no extra input is allowed. 

informative experiments result if we shift between different linear regulators. By 


checking (13.52) in the SISO case, we find that it is sufficient to use two regulators 
u(t) = -F, (¢g)v@) and u(t) = — F2(q)v(t) 


subject to 
[Fee — Fo(e'®)] #0. Vo 
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which is a very mild condition. This could be a useful way to achieving informative 
experiments when the requirement of good control is stringent. 


Bias Distribution 


We shall now characterize in what sense the model approximates the true system, 
when it cannot be exactly described within the model class. The discussion will 
be based on the frequency domain expressions for the limiting criterion function. 
developed in Section 8.5. see (8.76)-(8.77). 

Let us focus on the case with a fixed noise model H (q, 0) = H,(q). This case 
can be extended to the case of independently parameterized G and H, analogously 
to (8.73). Recall that any prefiltering of the data or prediction errors is equivalent 
to changing the noise model. The expressions below therefore contain the case of 
arbitrary prefiltering. For a fixed noise model, only the first term of (8.77) matters in 
the minimization, and we find that the limiting model is obtained as 


T 


G, = argmin f |Go(e’”) uz Ble”) _ Gel. o|? 
-r 5 


Pu 
Aa w (13.53a) 
LAGJE 
. ? } pe ; j 
Ben? = A Sul) et) He (13.53b) 


Pulo) ulw) 


This is identical to the open loop expression (8.71), except for the bias term B. See 
also Problem 8G.6. Within the chosen model class, the model G will approximate 
the biased transfer function Gy + B as well as possible, according to the weighted 
frequency domain function above. The weighting function P, / |H, is the same as 
in the open loop case. The major difference is thus that an erroneous noise model 
{or unsuitable prefiltering) may cause the model to approximate a biased transfer 
function. 

Let us comment on the bias function B. From (13.53b) we see that the bias- 
inclination will be small in frequency ranges where either (or all) of the following 
holds 


e The noise model is good (Hy — H, is small) 
e The feedback contribution to the input spectrum (f /®,,) is small 
e The signal to noise ratio is good (Ap/®, is small) 


In particular. it follows that if a reasonably flexible, independently parameterized 
noise model is used. then the bias-inclination of the G -estimate can be negligible. 


Variance and Information Contents in Closed Loop Data 


Let us now consider the asymptotic variance of the estimated transfer function Gn 
using the asymptotic black-box theory of Section 9.4. Note that the basic result (9.62) 
applies also to the closed loop case. 
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From the general expression (9.62) we can directly solve for the upper left 
element: 


a n do 
CovGy = —9®,(a) - ——— {13.54 
N doy lo) — |Pye(w) 
From (8.75) we easily find that the denominator is equal to ®/, so 
CovGy = ——~ = (13.55) 


N |S, lw)  N O(a) 


The denominator of (13.55) is the spectrum of that part of the input that originates 
from the reference signal r. The open loop expression has the total input spectrum 
here. 

The expression (13.55) is derived from the covariance matrix obtained for the 
maximum likelihood method for Gaussian noise. see Chapter 9. This means that it 
is also the asymptotic Cramér-Rao lower limit (for Gaussian noise: see (7.79)). so 
it tells us precisely “the value of information” of closed loop experiments. It is the 
noise-to-signal ratio (where “signal” is what derives from the injected reference) that 
determines how well the open loop transfer function can be estimated. From this 
perspective, the part of the input that originates from the feedback has no information 
value when estimating G. 

The expression (13.55) also clearly points to a fundamental property in closed 
loop identification: The purpose of feedback is to make the sensitivity function $y 
small, especially at frequencies with disturbances and poor system knowledge. Feed- 
back will thus worsen the measured data’s information about the system at these 
frequencies. 

However, this is not the whole truth. Feedback will also allow us to inject more 
input in certain frequency ranges, without increasing the output power. We shall see 
in Section 13.6 that for experiment design that involve output variance constraints it 
is always optimal to use closed loop experiments. i 

Finally, let us stress that the basic result (13.55) is asymptotic when the orders 
of both G and H tend to infinity, as well as N. 


13.5 APPROACHES TO CLOSED LOOP IDENTIFICATION 


As noted in Section 13.4, a directly applied prediction error method—applied as 
if any feedback did not exist—will work well and give optimal accuracy if the true 
system can be described within the chosen model structure (both regarding the noise 
model and the dynamics model). Nevertheless, due to the pitfalls in closed loop 
identification, several alternative methods have been suggested. Although these 
methods, as such, are not experiment design issues, it is natural to discuss them in 
the context of the current chapter. 
One may distinguish between methods that 


1, Assume no knowledge about the nature of the feedback mechanism, and do 
not use r even if known. 
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2. Assume the signal r and the regulator to be known (and typically of the linear 
form (13.40b)). 


3. Assume the regulator to be unknown. Use the measured r to infer information 
about it, and use the estimate of the regulator to recover the system. 


If the regulator indeed has the form (13.40b), there is no major difference between 
(1). (2). and (3): This noise-free relationship can be exactly determined based on 
a fairly short data record, and then also r carries no further information about the 
system, if u is measured. The problem in industrial practice is rather that no regulator 
has this simple, linear form: Various delimiters, anti-windup functions and other non- 
linearities will have the input deviate from (13.40b), even if the regulator parameters 
(e.g. PID-coefficients) are known. This strongly disfavors the second approach. The 
methods correspondingly fall into the following main groups: 


1. The Direct Approach: Apply the basic prediction error method (7.12) in a 
straightforward manner: use the output y of the process and the input u in the 
same way as for open loop operation, ignoring any possible feedback. and not 
using the reference signal r. 


2. The Indirect Approach: Identify the closed loop system from reference input 
r to output y, and retrieve from that the open loop system, making use of the 
known regulator. 


3. The Joint Input-Output Approach: Consider yv and u as outputs of a system 
driven by r (if measured) and noise. Recover knowledge of the system and the 
regulator from this joint model. 


We shall in this section treat each of these approaches. 


Direct Identification 


The Direct Identification approach should be seen as the natural approach to closed 
loop data analysis. The main reasons for this are: 


è lt works regardless of the complexity of the regulator, and requires no knowl- 
edge about the character of the feedback. 


è No special algorithms and software are required. 


e Consistency and optimal accuracy are obtained if the model structure contains 
the true system (including the noise properties). 


ə Unstable systems can be handled without problems. as long as the closed loop 
system is stable and the predictor ts stable. This means that any unstable poles 
of G must be shared by H, like in ARX. ARMAX and state-space models (see 
(4.23) and (4.91)). 


The only drawback with the direct approach is that we will need good noise models. 
In open loop operation we can use output error models (and other models with 
fixed or independently parameterized noise models) to obtain consistent estimates 
(but not of optimal accuracy) of G even when the noise model H is not sufficiently 
flexible. See Theorem 8.4. 
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This shows up when a simple model is sought that should approximate the sys- 
tem dynamics in a pre-specified frequency norm. In open loop we can do so with 
the output error method and a fixed prefilter/noise model that matches the specifi- 
cations. See (8.71) and Examples 8.5 and 14.2. For closed loop data. a prefilter‘noise 
model that deviates considerably from the true noise characteristics will introduce 
bias, according to (13.53). 

All this means that we cannot handle model approximation issues with full con- 
trol in the feedback case. However, consistency and optimal accuracy is guaranteed, 
just as in the open loop case, if the true system is contained in the model structure. 


A natural solution to this would be to first build a higher order model G using 
the direct approach, with small bias, and then reduce this model to lower order with 
the proper frequency weighting. While many model reduction schemes now exist, 
based on balanced realizations and the like, an identification-based way to achieve 
it is as follows: First simulate the model with an input w of suitable spectrum. thus 


generating noise free output Ẹ = Gu. Then subject the input-output data ¥. u to an 
output error model of desired complexity. This gives the model 


G* = argmin = [|e — G(ei”) "©, (w)dw (13.56) 
G 


Note, though. that reduction of unstable models may contain difficulties. 


Indirect Identification 


The closed loop system under (13.40b) is 


y(t) = Galg)r(t) + valt) j 
Go(q) 1 l 
~ 1+ FEDGolg) 7 + Fila)Galo) 13.57 


The indirect approach means that G « is estimated from measured y and r, giving 


Ga: and then the open loop transfer function estimate G is retrieved from the 
equation 


A 


a G 


en EE (13.58) 
1+ GF, 


An advantage with the indirect approach is that any identification method can be 
applied to (13.57) to estimate Ges, since this is an open loop problem. Therefore 
methods like spectral analysis, instrumental variables, and subspace methods, that 
may have problems with closed loop data, also can be applied. 

The major disadvantage with indirect identification is that any error in Fy (in- 
cluding deviations from a linear regulator, due to. e.g., input saturations or anti- 
windup measures) will be transported directly to G. 
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For methods. like the prediction error method, that allow arbitrary parameter- 
izations G,;(q. @) it is natural to let the parameters 8 relate to properties of the open 
loop system G. so that 

G(q.6) 
Galg.6) = —— 4 __ (13.59) 
1+ Fy(g¢)G(q. 0) 
That will make the task to retrieve the open loop system from the closed loop one 
more immediate. 

We shall now assume that G, is estimated using a prediction error method 

with a fixed noise model/prefilter ,: 


v(t) = Gag. O)r(t) + Aalqye(t) (13.60) 


The parameterization can be arbitrary, and we shall comment on it below. It is 
quite important to realize that as long as the parameterization describes the same 
set of G, the resulting transfer function q. Ôn) will be the same, regardless of 
the parameterizations. The choice of parameterization may thus be important for 
numerical and algebraic issues, but it does not affect the statistical properties of the 
estimated transfer function. . 

Let us now discuss bias and variance aspects of G estimated from (13.60) and 
(13.59). We start with the variance. According to the open loop result (9.63) (which 
holds also if the noise is not modelled: see Problem 9G.7), the asymptotic variance 
of Ge1,x will be 
n Dralo) a n |Sol D, 
N òa) N ®, 


regardless of the noise model H,. Here ®,. « is the spectrum of the additive noise v, 
in the closed loop system (13.57). which equals the open loop additive noise, filtered 
through the true sensitivity function. To transform this result to the variance of the 
open loop transfer function. we use Gauss’ approximation formula (see (9.56)): 


Cov Gen = (13.61) 


7 dG a dG \* 
CovG = Cov Ge | —— 13.62 
Ov dGa OV Ue ( F E) ( ) 
It is easy to verify that 
dG 1 : ®, D. 
= —. so CovGy = = — = ee 
dG- | Sol? N |Sol b, N @ 


which—not surprisingly—equals what the direct approach gives, (13.55). 
For the bias, we know from (8.71) that the limiting estimate 6* is given by (we 
write Gg as short for G(e'’. 8)) 


* . 5 Go Go J p, 
6" = argmin — - —_ -d 
6 Jail + F; Go 1+ F;Ge] |H, | 
PA 
_ f7 | Go — Ge |r Sol? O, 
= argmin —] = 
6 J-a|l + F; Go RAR 
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Now, this is no clear cut minimization of the distance Gy — Gg. The estimate 4” will 
be a compromise between making Gg close to Go and making 1/(1 + F.Ge) {the 
model sensitivity function) small. There will thus be a “bias-pull” towards transfer 
functions that give a small sensitivity for the given regulator, but unlike (13.53) it 
is not easy to quantify this bias component. However. if the true system can be 
represented within the mode! set. this will always be the minimizing model. so there 
is no bias in this case. 


Parameterizations. The above results are independent of how the closed loop svs- 
tem is parameterized. A nice and interesting parameterization for this indirect iden- 
tification of closed loop systems has been suggested by Hansen, Franklin. and Kosut 
(1989)and Schrama (1991). It is based on so-called dual Youla-Kucera parameteri- 
zation. See Problem 13G.3. 


Indirect Identification with Nonlinear, Known Regulator. The indirect technique 
can also be applied when the regulator is non-linear. but with considerably more 
work: One will have to compute the model output ¥(1|@) = f(@. R.r') as a func- 
tion of the open loop dynamic parameters @, the known regulator R. and the past 
reference signal values z’, and then form an output error criterion. 


Joint Input-Output Identification 


Assume that there possibly is a non-measured signal w in the regulator in addition 
tor: 
u(t) = r(t) + wt) — Fy(g) y(t) (13.63) 


We assume that w is independent of r and v. The closed loop from v. w and r can 
be written similarly to (13.41) as: 
y = GoSor + Sov + GoSou = Gar + v4 (13.64a) 


Identification methods that use models of how both y and u are generated are termed 
joint input-output techniques. This leaves a number of variants open. which fall into 
the following groups: 


1. Allow correlation between vı and v2, and work with a model: 
y 
| | = Gr+ Hr (13.65) 
u 


2. Disregard the correlation in the noise sources and treat (13.64a) and (13.64b) 
as separate models. 


The first approach works also when there is no measurable reference signal r. It 
can be shown that this approach in essence is equivalent to the direct approach of 
estimating G in (13.40a) and F, in (13.63). See Problem 13G.4. 

The second approach in turn has some variants. but they all have in common 
that the system dynamics is estimated as 


G = (13.66) 
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where Ge and Gi are estimated from the two open loop systems (13.64). The case 
when spectral analysis estimates are used was described and analyzed by Akaike 
(1967. 1968). Various parametric approaches have also been suggested. 

From (13.64) we see that 


Ga = GoGru (13.67) 


so cancellations should take place when (13.66) is formed. This will however not 


happen for the estimated models, due to the model uncertainties, so G will then 
be of unnecessarily high order. It is thus natural to enforce (13.67) in the model 
parameterization: 


6 
Galq.9) = Gig, 0)Slq, n). Grula, ©) = S.. O= K (13.68) 


If we assume v and v2 to be independent white noises with variances 1 and 1/a. we 
obtain the following identification criterion for (13.64), (13.68): 


VO) = $ IO — G. OSa. Mr? + $ ælu — Sa. m? (13.69) 
The question still remains how to parameterize G and S. Some possibilities, including 
the so-called coprime factor method, are described in Van den Hof et.al. (1995b). 


The Two-Stage Method. We shall turn to the case where œ —> œ in (13.69). If œ 
is very large. the second sum will dominate the criterion when 7 is determined to 


give 5. Since 6 only enters the first sum, and S is given from the second term, this 
procedure is the same as first estimating S in (13.64b) and then using 


a(t) = Sq, nr) (13.70) 
in 
yE) = Gg, alt) + v(t) (13.71) 


to estimate G. This is the two-stage method suggested by Van den Hof and Schrama 
(1993). A variant with a non-causal S is described in Forssell and Ljung (1998d). Let 
us analyze the properties of the latter variant: Suppose S(7) is parameterized as a 
non-causal FIR filter: 


M 
Sa. = D> ug 
k=-M 


and take M so large that any correlation between u(t) and r(s), |s — t| > M can 
be ignored. The model 


u(t) = S(q.n)r(t) + v(t) 
can then be estimated using the least squares method giving 
u(t) = S(q.n)r(t) 


such that the sequence ù = u — & is uncorrelated with the sequence r. Notice that 
this holds irrespectively of the true relationship between r and u, which very well may 
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be non-linear! (In this latter case u and r will be dependent. but still uncorrelated.) 
Suppose that the true system is given by (13.40a). Then, inserting ù gives 


y= Gott +r Gol 


where the “noise” sequence v + Gow is uncorrelated with the “input” sequence ġ. 
Suppose now that we estimate G from (13.71) with a fixed noise model H,. From 
(8.71). which is valid if the noise and input are uncorrelated, we then have the mode] 
converging to G* = G(@*) where 


0* = argmin f (Gote) - TORDE AO [Hit dw (13.72) 


Here the spectrum ©, is fixed and known to us. So, if ù is persistently exciting. we 
can consistently estimate Go by using a large enough model structure Gs. In any 
case we can achieve an approximation to Go in a known frequency norm ®,, /'H,|* 
that we can affect by a proper choice of noise model (prefilter). We have thus gained 
something over the direct identification method, which gives a possible bias as in 
(13.53). The price is the increased variance caused by the extra “noise” Gyu. Note 
again, that this result is valid even if the regulator is non-linear. 


Summarizing Remarks 


We may summarize the basic issues on closed loop identification as follows: 


è The basic problem with closed loop data is that it typically has less information 
about the open loop system—an important purpose of feedback is to make the 
closed loop system less sensitive to changes in the open loop system. 


è Prediction error methods, applied in a direct fashion, with a noise mode] that 
can describe the true noise properties still give consistent estimates and optimal 
accuracy. No knowledge of the feedback is required. This should be regarded 
as a prime choice of methods. 


e Several methods that give consistent estimates for open loop data may fail when 
applied in a direct way to closed loop identification. This includes spectral and 
correlation analysis, the instrumental variable method. the subspace methods, 
and output error methods with incorrect noise model. 


e If the regulator mechanism is correctly known, indirect identification can be 
applied. Its basic advantage is that the dynamics model G can be correctly es- 
timated without estimating any noise model, even when Go is unstable. How- 
ever, any error in the assumed regulator will directly cause a corresponding 
error in the estimate of G. Since most regulators contain non-linearities, this 
means that indirect identification has fallacies. 


è The joint input-output approach in the two-stage variant offers the advantage 
that model approximation in a known and user-chosen frequency weighting 
norm can be achieved (see (13.72)) at the price of higher variance. 
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13.6 OPTIMAL EXPERIMENT DESIGN FOR HIGH-ORDER BLACK-BOX 
MODELS 


In this section we shall use the asymptotic variance expression (9.62) to design optimal 
experiments. In Chapter 12 we derived a general design criterion based on this 
expression. (12.32): 


Jp(X) = J Ww, Xjdw (13.73a) 
AoC 1 (w) — 2Re [Cilo Pye{—w)] + C22 (w) Py (w) 
Ao, (w) — |Pue(w)|? 


Here, we dispensed with the scaling n/N, which is immaterial for the choice of 
experimental condition X. Recall that the variance expression is asymptotic in the 
model! order. 

For design variables, we can work with different equivalent setups. An imme- 
diate choice is to design the input spectrum and the cross spectrum: 


X = {®,,, Pae} (13.74) 


Yw. X) = - P, (w) (13.73b) 


A more explicit way is to work directly with the regulator and the reference signal 
spectrum: 


u(t) = —Fy(g)¥(t) + r(t) 
and regard X = {F,, ®,} as the design variables. In this case we have 
hoPy(w) — (Puelo)? = AoF)(o) 


where È; is that part of the input spectrum that originates from the reference signal 
(see (13.44)). 


Criterion Involving G Only. 


We shall first consider the case where Ci2 = Cx = Q, that is the design criterion 
involves only the dynamics part G: 


J (X) = i Cov Gye) Cy (w)dw ~ [ PAO) istics (13.75) 


x -a Pi, (@) 
This is no doubt the most common special case, where the quality of the noise model 
is of less importance. 


We shall minimize this design criterion, subject to constrained variance of the 
input and the output in the general form 


aEu’? + BEy" <1 (13.76) 


Here, we assume that a and £ are chosen so that it is at all possible to achieve 
this constraint with the given disturbances. The solution is given by the following 
theorem. 
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Theorem 13.3. Consider the problem to minimize J, in (13.75) with respect to the 


design variables X = {F,. ®,} under the constfaint ae 76). This is achieved by 
selecting the regulator u(t) = —F,(q)y¥(t) that solves the standard LQG problem, 


p: = arg minfa Eu? + BEv’). y = Gou + Me (13.77) 
The reference signal spectrum shall be chosen as 


> . 12 
i + Golet) Fy (ei); 
PP (w) =u D.o) lo) — m (13.78) 
ya + B iGole@)|" 


where u is a constant, adjusted so that equality is met in (13.76). 


Proof. We first establish the following straighforward result. For two positive func- 
tions A(t) and X(t} we have that 


The integral J andr with constraint J X(t) < K 


(13.79) 
is minimized by X(t) = py At). u= 


K 
f J/A@dt 


To prove this we have by Schwarz’s inequality 


ll Vaar | = 


A(t) A(t) 
< jaog f xoa < K | Ea 
fa s [f JA@dr] 
X(t) K 


| 
R, 
en, 
© El 
2 z] 
x| 
8 
(2 
a 


OF 


which gives (13.79). 
For the closed loop we have (see (13.42)-(13.44) and (8.75)) 


1 5 ha 
S ) = — ~. P,( ) = S “®,( ) + F, “|S "P, 
00) = rga O = Do) + FPS Pete 


Dw) = [Gol lS w) + ARAC, (13.80) 


Pi, = [Sol?®,. p? = IFF ISo D. lPuelo)l = AoD, (w) 
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This means that it is convenient to use 
X = {b}. Fy} (13.81) 
as design variables. With this, we can rewrite the constraint (13.76) as 


1 > f a®,(w) + BD, (wdw 


T 


x ` F,|* 
=f [te + pioa; + C EE Jaw 


{1 + GoF,|? 
and the criterion is 


_ [* Oo 
min —C\dw 
F, D, J» Pu 

The criterion does not depend explicitly on F,. and is a question of making ®), as 
large as allowed. From the constraint we see that this means that F, should be chosen 
by solving the LQ probiem 


_ [T alFyl +8 ; ; 
min f. FaR S = mm E(au“ + By’) 
for y(t) = Go(g)u(t) + v(t). ult) = -F (qQ) y) 


Define the constant y as 


x FoPty2 
f a EE 2 L E 


Y = 1 ba 1 
ax |1 + GoFeh e | 


which is assumed to be positive. 
The minimization problem now reads 


[7 Cu ; i Aar 
min ——— dow, with (a + BiGo|")P,dw < y 
o Jx È 


u u -n 


In (13.79) we take X = (@ + B|Go|?) 7) and A = ,C1 (œ + B}Go|*) which gives 
port = u j ®,C11 
: Yo + BIG 
which, via (13.80), concludes the proof. oO 
The theorem tells us a number of useful things: 


e The optimal experiment design depends on the (unknown) true system and 
noise characteristics. This is the normal situation for optimality results, and in 
practice it has to be handled by using the best prior information available about 
the system. 
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e If we have a pure input power constraint, Eu? < K and the system is stable. 
then it is always optimal to use open loop operation, Fy = 0. The input 
spectrum is then proportional to y ®,C1ı. which shows that the input power 
should be spent in frequency bands where there is substantial noise (P, large) 
and/or where a good model is particularly important (Cı; large). 


e Ifthe constraint involves any limitation on the output variance (8 > 0), then it 
is always optimal to use closed loop experiments. In these cases the regulator 
does not depend on the criterion function C11, but only the constraint and the 
system. 


Input Variance Constraint Only 


Consider now the situation were also the noise model quality enters the criterion 
(C.2(w) > 0), but there is no cross term (Ci? = C2, = 0). Suppose also that the 
constraint incolves the input variance only. That is, we have 


g 
kin J Poc ie ECA Ode (13.82a) 
-n AgPy(w) — Puelo) 
X = {%,. Due} (13.82b) 
Ew < K (13.82c) 


Since the design variable $e does not enter the constraint, we realize immediately 
that the optimum choice of this variable is ®„e = 0. since this minimizes the integrand 
in the criterion pointwise. This means that open loop operation is optimal. This, in 
turn, means that the variance of H is not affected by the design. In other words. the 
integrand reads 


O,.Ci1 + PCan j 
P, Ào 
so this case is solved by Theorem 13.3: The optimal design is an open loop experiment. 
with an input with spectrum 
p, = Hu D.C (13.83) 
where u is adjusted so that the input power constraint is met. 


Other cases of the general problem (13.73) are treated in the Problem Section. 
and in the references mentioned in the bibliography. 


13.7 CHOICE OF SAMPLING INTERVAL AND PRESAMPLING FILTERS 


The procedure of sampling the data that are produced by the system is inherent in 
computer-based data-acquisition systems, It is unavoidable that sampling as such 
leads to information losses, and it is important to select the sampling instances so 
that these losses are insignificant. In this section we shall assume that the sampling 
is carried out with equidistant sampling instants, and we shall discuss the choice of 
the sampling interval 7. 
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Aliasing 


The information loss incurred by sampling is best described in the frequency domain. 
Suppose that a signal s(t) is sampled with the sampling interval T: 


sk = s(kT), k=1.2.... 


Denote by w, = 27 /T the sampling frequency, and then wy = w,/2 isthe Nyquist 
frequency. Now. it is well known that a sinusoid with frequency higher than wy 
cannot, when sampled. be distinguished from one in the interval [~on , wy]: 
With |w| > wy there exists a w: —wy < @ < wy so that 
coswkT = coswkT 


k= 0,1,... 13.84 
sinwkT = sin@kT ( ) 


This follows from simple manipulations with trigonometric formulas. Consequently, 
the part of the signal spectrum that corresponds to frequencies higher than wy will be 
interpreted as contributions from lower frequencies. This is the alias phenomenon; 
the frequencies appear under assumed names. It also means that the spectrum of the 
sampled signal will be a superposition of different parts of the original spectrum: 


Dw = X Dw + ros) (13.85) 


r=-% 


Here $$ is the spectrum of the continuous-time spectrum. defined by (13.36) and 
pP (w) is the spectrum of the sampled signal: 


x 
= 1 

Rr(éT) = Essie = dim = XO Es(kT)s(kT + €T) 

=~ (13.86) 


x 
Ow) =T $O Rr (eT)e T 


€=-x 


The effect of (13.85) is often called folding: the original spectrum is “folded” (and 
added) to give the sampled spectrum. 


Antialiasing Presampling Filters 


The information about frequencies higher than the Nyquist one is thus lost by sam- 
pling. It is then important not to make bad worse by letting the folding effect distort 
the interesting part of the spectrum below the Nyquist frequency. This is achieved 
by a presampling filter «(p): 


Sp(t) = K(p)s(t) (13.87) 
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{p is here the differentiation operator). Analogous to the formulas of Theorem 2.2, 
the spectrum of the filtered signal s(t) will be 


Pf (w) = Iliw Of (w) (13.88) 
Ideally, « (iw) should have a characteristic so that 
Ik(@iw)| =1. fol < wy 
; (13.89) 
\«(iw)| = 0, lw| > wy 
This can be realized only approximately. In the ideal case (13.89), we would have 


w > lwr| 
which means that the sampled signal 
sf = sp(kT) 
will have a spectrum. according to (13.85). 
Ow) = Pf (w). -wy < WwW < wn (13.90) 


With the filter (13.87) and (13.89) we thus achieve a sampled spectrum with no alias 
effects. Therefore, this filter is also called an antialiasing filter. According to what 
we said, such a filter should always be applied before sampling if we suspect that the 
signal has nonnegligible energy above the Nyquist frequency. 


Noise-reduction Effect of Antialiasing Filters f 


A typical situation is that the signal consists of a useful part and a disturbance part. 
and that the spectrum of the disturbances is more broadband than that of the signal. 
Then the sampling interval is usually chosen so that most of the spectrum of the useful 
part is below wx. The antiasiasing filter then essentially cuts away the high-frequency 
noise contributions. Suppose we have 


s(t) = m(t) + v(t) 


where m(t) is the useful signal and u(r) is the noise. Let PY (w) be the spectrum of 
u(t). The sampled, prefiltered signal then is 


sf =m, + uf, sf = sp(kT) 


where the variance of the noise is 


E(f yY = J. i Ow) dw = = x D Di (w + ros) dw 


r= Xe 
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From this expression we see how the noise effects from higher frequencies are folded 
into the region [—w,;. wy] and are thus contributing to the noise power. By elimi- 
nating the high-frequency noise by an antialiasing filter (13.89). the variance of cr 
is thus reduced by the term 


WN 
>| Pi (w+ re,)dw = Í D1 (w)dw 
i| >a 


r=0 bed (20 U 
compared to no presampling filter at all. This is a significant noise reduction if the 
noise spectrum has considerable energy above the Nyquist frequency. 


Antialiasing Filters during Data Acquisition 


Let us comment on the role of the antialiasing filters in system identification ap- 
plications. Suppose first that the svstem is not under sampled-data control so that 
the continuous-time input is not piecewise constant. This may be the case when we 
collect data from a process in normal operation. If the input then is band limited 
and has no energy above the frequency wg. this means that all useful information in 
the output also hes below wg. provided the process is linear. We could then apply 
an antialiasing filter with cutoff frequency wg and sample with T = z/wg with no 
loss of information. If the input ts not bandlimited. the antialiasing filter will destroy 
useful information at the same time as the noise is reduced. If T is chosen so that 
the Nyquist frequency (= the cutoff frequency for the filter) is above the bandwidth 
of the system, the loss of useful information is insignificant. Notice that in this case 
the antialiasing presampling filter should be applied also to the input signal. 

Consider now the case that the input is piecewise constant over the sampling 
interval. Then, clearly, the sampled input equals the piecewise constant values. and 
no presampling filtering should be applied to this sequence. The stepwise changes in 
the process input do, though, contain high frequencies that could travel through the 
process to the output. An antialiasing filter applied to the process output could thus 
distort useful information. There are three ways to handle this problem: 


1. Sample fast enough that the process is well damped above the Nyquist fre- 
quency. Then the high-frequency components in the output that originate from 
the input are insignificant. 


2. Consider the antialiasing output filter as part of the process and model the 
system from input to filtered output (this might increase the necessary model 
orders, though). 


3. Since the antialiasing filter is known, include it as a known part of the model, 
and let the predicted output pass through the filter before being used in the 
identification criterion [this approach is illustrated in (13.95) and (13.96)]. 


Solution 1 is the most natural; it is conceptually depicted in Figure 13.8. 


Remark: For control purposes it might be a good idea to apply a low-pass filter 
to the piecewise constant sampled-data input sequence. This will also be helpful for 
solution 1. 
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B N 


Figure 13.8 Sampling depicted in the frequency domain. Solid line: frequency 
characteristics of the process: dashed line: noise spectrum: dotted line: Frequency 
characteristics of the antialiasing filter. N: the Nyquist frequency. B: Band-width. 


Some General Aspects on the Choice of T and N 


If the total experiment time 0 < ¢ < Ty is limited, but the acquisition of data within 
this time is costless, it is clearly advantageous from an information theoretic point of 
view to sample as fast as possible. Slower sampling leads to data sets that are subsets 
of the maximal one. and hence is less informative. The cost effectiveness of the 
new information will. however. typically decrease as we sample faster and faster (cf. 
Figure 13.9). In this idealized case, where adding new data points is costless, there are 
only two aspects that may prevent us from sampling as fast as technically possible: 
One is that building sampled models with very small sampling interval compared 
to the natural time constants is a numerically sensitive procedure (all poles cluster 
around the point 1). See Problem 13G.1. The other is that the model fit may be 
concentrated to the high-frequency band (see the following discussion on bias). The 
latter problem should be dealt with by prefiltering the data so as to redistribute bias. 
as explained in Section 14.4. The former problem should probably be handled bs 
fitting continuous-time models directly to the fast sampled data with models of the 
type (2.23). 

Another idealized situation is that the experiment time 0 < t < Ty as such 
is costless, and all cost is associated with acquiring and handling the data. We could 
then settle for collecting. say. N data and select F (then Ty = N -T )so that the data 
set becomes as informative as possible. A F that is much larger than the interesting 
time constants of the system would then yield data with little information about the 
dynamics. A small 7. on the other hand. would not allow for much noise reduction. 
and the data might be less informative for that reason. A good choice of T should 
thus be a trade-off between noise reduction and relevance for the dynamics. 


If the model should be used for control purposes. certain other aspects will 
enter. The sampling interval for which we build the model should be the same as for 
the control application (unless we want to recalculate it from one sampling interval 
to another). A fast sampled model will often be nonminimum phase (Åström and 
Wittenmark, 1984), and a system with dead time may be modeled with delay of mary 
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sampling periods. Such effects may cause problems for the control design and will 
therefore influence the choice of T. 

For the choice of N. it is useful to have the asymptotic result (9.92) in mind. It 
describes how small the model order to sample size ratio has to be in order to achieve 
a certain accuracy for a given design and given noise spectrum. 


Bias Considerations 


We know from Chapter 8 that the fit between the model transfer function and the 
true one is related to a quadratic norm [see (8.71)]: 


niT ` : 3 

0* = arg min | |Go(e’") — Ge .8)|" O(w. 0) dw (13.91) 
4 —7/T 

Here Q(w. 0) is the filtered input spectrum divided by the noise spectrum: 


®,,(w) 
Q(w,9) = ———, 
|H (eiT, 6) 


We also marked where the 7 -dependence enters. As T tends to zero, the frequency 
range over which the fit is made in (13.91) increases. Normally. though, the natural 
dynamics of the system and the model are such that Gole!) — G(e!®T , 8) is well 
damped for high frequencies so that the contribution from higher values of w in 
(13.91) will be insignificant even if the input is broadband. An important exception 
is the case where the noise model is coupled to the dynamics, as in the ARX structure 
(4.9), where H (e'®T) = 1/A(e!@7). Then the product 


|Go(et") _ Gle, 6)| 
|H(e2T.0) : 


does not tend to zero as w increases and the fit in (13.91) is pushed into very high 
frequency bands as F decreases. This may lead to quite curious results. as illustrated 
in Wahlberg and Ljung (1986). In such cases. very fast sampling is thus undesirable, 
even apart from the numerical difficulties that may arise. The effects can be counter- 
acted by proper prefiltering, as described in Section 14.4. In any case, it is important 
to keep in mind the influence of T on the bias distribution. 


Variance Considerations 


The variance of an estimated parameter based on a given number of data will depend 
on the average information per sample. This, as mentioned previously. is a trade-off 
between the noise reduction that slow sampling may offer and the poor information 
about the dynamics that slowly sampled data contain. To pinpoint this trade-off, let 
us consider a simple example. 
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Example 13.4 Optimal Sampling 


Consider a continuous-time system 


yO = u(t) + u(t) 


1+ pt 
or 
TX(t) + x(t) = u(t) 


(13.92) 
yt) = x(t) + v(t) 


where u(t) is a very broadband noise (= “almost white noise”) with variance + .'T;,, 
where 1/ To is its bandwidth [i.e.. Tọ is the smallest sampling interval under which the 
sampled version of u(t) is truly white]. As a simple presampling filter, we employ 
an integrator 


1 fk 
y= ¥(KT) = = y(t)dt = X(KT) + ur (kT) (13.93) 
T Sra(k-)T 
Here x(kT) is the mean of the useful signal x(t) over the sampling interval and 
{vr (kT)} is a sequence of independent random variables with variance 2./T (if 
T > To). We use an output error model set 


x(kT + T) = eT x(kT) + (1 — e% u(kT) (13.94a) 
: q(1 — e) 
(kT + T\kT,a) = x(kKT +T) = Geer NR (13.94b) 
Here, the model parameter a corresponds to 1/t with t as in (13.92). We let the 
input signal be a sinusoid (piecewise constant) of frequency wo: 
u(kT) = @ - cos(wokT) 


When calculating the predictor (13.94), we ignored the presampling filter, which may 
be reasonable when T is small enough. To allow for a fair treatment also of larger 
values of T, we could take the presampling filter (13.93) into account and let the 
prediction be 


d 
at) = ax(t,a) + au(f) (13.95a) 
1 (k+1)T 
$7 (KT + T\kT,a) = = x(t, a)dt 
T t=kT 


_ pa- e~*7 [1 —(1/aT)(1 — e727 )] 


— u(kT) (13.95b) 
q-e 


— 1 grat 1 _ —aT\2? 3.906 
p= (1 aT"! e sion e 2) (13.96 
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(Notice that the second term in the numerator is % aT /2 for small 7.) The asymp- 
totic variance of ây is now given by Theorem 9.1 as 


1 Varvr7(t 
E(@y — 1/t)’ ~ LaO (13.97) 
N E (Wr(t)) 
where 
= À d, 
Varvr(t) = = and wr(t) = —yr(kT + TIKT, a)la=uj) 
T da 
For the simplified expression (13.94), we have 
TeNt(q — 1) 
writ) = ‘quae COS Wot 
and 
= Te? /*(2 — 2 T 
EG ee ee (13.98) 
[1 — 2e-7/* coswpT + e727/7] 
We thus have 
i À 
Varáy ~ (13.99) 


NT - E rOy 


This expression tends to infinity as 
ltr 
T3 


as T increases to infinity; this is the effect of poor information about r with slow 
sampling. Also, some calculations reveal that it tends to infinity as 1/T when T 
tends to zero; this is the effect of poor noise rejection at fast sampling. We have thus 
formalized the earlier mentioned trade-off. 


The exact predictor (13.96) gives similar, but more complicated expressions. In 
Figure 13.9 the expression (13.99) and the exact counterpart are plotted as funtions 
of T for w = 1/t. The figure reveals two things: 


1. The optimal choice of sampling interval lies around the time constant of the 
system. 


2. It is far worse to use a too large T than a too small one: T = 10r gives a 
variance more than 10° times the optimal one, while T = 0.11 gives a variance 
that is less than 10 times the optimal one. g 
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Figure 13.9 Variance of 4y plotied as a function of the sampling interval 
Tian = 1/1) (1.) Expression (13.99). (2.) Using (13.96). 


Conclusions 


Let us summarize the discussion on sampling rates as follows. 


e Very fast sampling leads to numerical problems, mode! fits in high-frequency 
bands. and poor returns for extra work. 


e Asthe sampling interval increases over the natural time constants of the system. 
the variance increases drastically. 


e Optimal choices of T for a fixed number of samples will lie in the range of 
the time constants of the system. These are, however, not exactly known. and 
overestimating them may lead to very bad results. 


All these aspects point to the advice that a sampling frequency that is about 
ten times the bandwidth of the system should be a good choice in most cases. Note 
that this discussion concerns the sampling rate chosen for the model building. With 
“cheap” data acquisition we can always sample as fast as possible during the experi- 
ment and leave the actual choice of F for later by digitally prefiltering and decimating 
the original data record. 


13.8 SUMMARY 


Careful experiment design, yielding data with good information is the basis of a 
successful identification application. In this chapter we have discussed how to design 
good experiments. The leading principles are as follows: 


e Let the experimental condition resemble the situation for which the model is 
going to be used. 


e Identifiability is secured by persistently exciting input and not allowing too 
simple feedback mechanisms (Theorem 13.2). 
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e To minimize the parameter variance with respect to the experiment design 
variables, the expression 


PiX) hol E ( d saio) d V(1|@) i 
~Ne iw — ýy — y 
f R da d8” 


shows that interesting parameters must have a clear effect on the output pre- 
dictions. 


è Periodic inputs have certain advantages, in particular, for single input systems. 
Sum of sinusoids and PRBS signals may then be good choices. To make use of 
the advantages, an integer number of periods should be applied. 


e For systems operating in closed loop, the basic choice of method is to apply a 
prediction error method in a direct fashion, using a flexible noise model. 


e To minimize a quadratic frequency-domain fit of the estimated transfer function 
Ga e'®) ina (high order) linear model set, use, in open loop. 


balo) ~ y Dlo) » Ci (w) 


where C), is the weighting function in the criterion of fit (Theorem 13.3). 


èe Asuitable choice of sampling frequency lies in the range of ten times the guessed 
bandwidth of the system. In practice, it is usefu! to first record a step response 
from the system. and then select the sampling interval so that it gives 46 
samples during the rise time. 
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Section 13.4: A survey of identifiability in closed-loop experiments is given jp 
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Van den Hof and Schrama (1993). Schrama (1991), and the survey Forssell and Liung 
(1998a). The Jatter also describes the use of a non-causal FIR filter in (13.70). The 
application of direct output error methods to identify unstable systems is discussed 
in Forssell and Ljung (1998c). For the subspace methods of Section 10.6. several 
algorithms have been specifically devised to handle closed loop data: The algorithm 
in Chou and Verhaegen (1997)can be described as a direct approach, while Van 
Overschee and DeMoor (1997)and Verhaegen (1993)can be described as indirect 
and joint input/output, respectively. 


Section 13.6: The section is based on Forssell and Ljung (1998b). Ljung (198Sb), 
and Ljung (1986). Open-loop experiments are treated in Yuan and Ljung (1985 )and 
closed-loop ones in Gevers and Ljung (1986). 


Section 13.7: For alias effects and anti-alias filtering see. e.g.. Oppenheim and Will- 
sky (1983)or Astrom and Wittenmark (1984). A study of the choice of sampling 
interval similar to our Example 13.4 is given in Åström (1969). A frequency-domain- 
based study is given in Payne. Goodwin. and Zarrop (1975). The bias considerations 
are further discussed in Wahlberg and Ljung (1986). 


13.10 PROBLEMS 


13G.1 Effects of round-off errors: Suppose that the data are measured with a precision of 
10 bits (which is quite reasonable for typical A/D converters) and suppose that the 
bandwidth of the svstem (the highest frequency of interest) is wg. Follow the advice 
of Section 13.7 and choose the sampling frequency w, = 10wg. What is the lowest 
frequency that can be adequately modeled with the finite precision data? (Hint: A 
natural mode with time constants behaves, approximately. as 


T 
vyu +T) = ¢ - =) y(t) + Luto 


if T & t. where 7 is the sampling interval T = 277/w,.) Answer: Lowest frequency 
2 0.01wg. Notice the implications on how wide (i.e., narrow) are the frequency ranges 
that can be adequately modeled! Reference: Goodwin (1985). 

13G.2 Singular criterion matrix: Suppose that the criterion matrix C(w) in (12.6) and 
(13.73) is singular. and rewrite it 


|M =" 


C(w) = so | M(e-i”) 


13G.3 


13G.4 
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Show that. in (13.73b). 


- @, oM io) — Pre i 
tonie e l pare r= Do l 


Ao 2a, (w) = (Dueto 


Conclude from this that the criterion 


aT 
min | P(w. X) dw (13.100) 


T 


subject to any constraint is minimized by the closed-loop design 


ult) = s +r (13.101) 
Gog) M (q) + Hotay 
for anv extra input r (ż) [including r (7) = 0]. provided it is an admissible {u(ż)}. That 
is. (13.101) gives the global minimum of (13.100) w.r.t X in (13.73) (reference: Ljung, 
1986). 
Youla-Parameterization is a way of parameterizing all stabilizing regulators for a 
given systems—or equivalently parameterizing all systems that are stabilized by a 
given regulator F,. (Sec. e.g.. Vidyasagar, 1985.) In the SISO case it works as follows. 
Let Fy = X/Y (X.Y stable. coprime) and let Gaom = N/D (N. D stable. coprime) 
be any system that is stabilized by F,. Then, as R ranges over all stable transfer 
functions. the set 
N(qg) + Y(g)R(q. 8) 
G = lo . G(q.8) = soe | 
Diq) — X(q)R(q. 9) 


describes all systems that are stabilized by F,. This idea can now be used for identifi- 
cation (see. e.g.. Hansen. Franklin. and Kosut. 1989. Van den Hof and Schrama. 1991): 
Let u and y be input-output data from a plant controlled by the regulator Fy. Define 
X.Y. N. D as above and let 


x(t) = v(t) — Lig)N(qg)Y¥(g)r(r) 
x(t) = LOY orit) 


where L = 1/(Y D+ NX). which is stable and inversely stable (since G „om is stablized 
by F,). Then estimate R(q.@) from the open loop identification problem 


z(t) = Rig. @)x(t) + H(q. @)e(t) (13.102) 


Use the resulting estimate Rin G to find the corresponding estimate G of the transfer 
function from u to y. Show that this method is the same as applying indirect identifi- 
cation for the model parameterization G. The main advantage of this method is that 
the obtained estimate G is guaranteed to be stabilized by F,. 


Consider the joint input-output mode! (13.65) with the parameterization 


i G(q.8) 
OST a a 
INTEGRO aA l | 


H(q,0@) = 


1 H(q.9) G(q.?) 
1 + G(q.0)Fy(q.9) | —F(q.0)H(q.) 1 
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13E.1 


13E.2 


13E.3 


13E.4 


13E.5 


13E.6 


This is the structure we obtain if we think in terms of the regulator (13.63) and take 


, . _ [ar 0 
v=le w I Assume that the covariance matrix of v is fe deal Apply the 
Paa 


multivariable prediction error/maximum likelihood method to this parameterization 
and show that the criterion becomes 


fie , 
Va(Z*.0) = = DLA. (vt) — G(q. @)u(r))fP 
l t=] 


N 


1 . 
+ — > [ult) — rt) + FG. MOF 


AD 


i=l 
If G and F, are independently parameterized, this is the same as estimating G by the 
direct method and separately determining F, from (13.63). 


Suppose that the signal u (z } is persistently exciting of order n. Give a condition on the 
stable filter L(g) that guarantees that up (t) = L(q)u(r) is also persistently exciting 
of order n. 


Consider the ARX structure 
A(q)v(t) = Blq)u(t) + e(t) 


where the degree of B is np. Show that an open-loop input that is persistently exciting 
of order ną is sufficiently informative with respect to this set. regardless of the order 
of A. provided the process noise is persistently exciting. 


Consider a model structure 
y(t) = Gg. pult) + Hiq. neit) 
with independent parametrization of G and H. Show that it is not possible to aftect 


the accuracy of 4 in an open-loop experiment. 
Consider the FIR model structure f 


y(t) = butt — 1) +... + bmult — m) + eit) 
Determine the input spectrum that minimizes det P,{X) subject to the constraint 
Ew (t) <1. 
Consider the model structure 


y(t) + axit — 1) = bu{t — 1) + eft) 


and determine the open-loop input spectrum that minimizes det Py(X) subject to 
Eu*(t) < 1. What is the optimal input in case $ is fixed to the value 1? Assume that 
the true system is given by 


y(t) — O.S¥(t — 1) = u(t — 1) + ex(t) 


where eg(f) is white noise. 

Suppose we are interested in the dynamics from propeller velocity to speed of a ship. 
We may measure both these signals but may only affect the torque of the propeller 
axis from the engine. This axis is also affected by forces from the water resistance that 
depend on the ship's speed in a complex manner. Discuss the identifiability of the loop 
from propeller velocity to ship speed based on such experiments. 


13E.7 


13T.1 


13T.2 


13D.1 
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As mentioned in the text, it is not straightforward to use PRBS for multi-input systems. 


One idea for a two-input system is as follows: Let s* = s(t),t = 1,....M bea 
maximum length ( M ) PRBS signal. Make an identification experiment over a multiple 
of 2M samples by letting tı = uz = s™ for the first M samples. and u) = ~u = s" 


for the next M samples. Show that this gives the same input covariance matrix as 
exciting one input at a time with /2s™, ie. u} = /25". uw. = 0 for the first M 
samples and then uw, = 0, u = /2s™ for the next M samples. 

Minimize Jp(X) in (13.73) with respect to ®, with ®,,(w) = 0. subject to the con- 
straint Ev-(t) < æ. Assume that the true system is given by 

¥(t) = Golq)utt) + Hol(g )eolt) 

Experiment design for minimum variance control: Suppose that the intended model 
application is minimum variance control. Use Problem 13G.2 to compare the perfor- 


mance degradation J/(X) obtained with an optimal open-loop input to that obtained 
for the overall optimal input (reference: Gevers and Ljung. 1986). 


Consider the time-varying system 
y(t) = G,(g)u(t). all z 
such that 
Gig) = Gilg). kN <t<kN+aN 
Gilg} = Gilg), kN+aN <t < (kK +1)N 
Let the spectrum of u be $, (w). Show that, as N increases, the spectrum of y becomes 


D lw) = aG,(e)b,(w)G] (e) + (1 — @)G2(e)&, (WGI (e) 
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PREPROCESSING DATA 


When the data have been collected from the identification experiment, they are not 
likely to be in shape for immediate use in identification algorithms. There are several 
possible deficiencies in the data that should be attended to: 


1. High-frequency disturbances in the data record, above the frequencies of in- 
terest to the system dynamics 


2. Occasional bursts and outliers, missing data, non-continuous data records 
3. Drift and offset, low-frequency disturbances, possibly of periodic character 


It must be stressed that in off-line applications, one should always first plot the data 
in order to inspect them for these deficiencies. In this s¢ction we shall discuss how 
to preprocess the data so as to avoid problems in the identification procedures later. 


14.1 DRIFTS AND DETRENDING 
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Low-frequency disturbances, offsets, trends, drift, and periodic (seasonal) variations 
are not uncommon in data. They typically stem from external sources that we may 
or may not prefer to include in the modeling. There are basically two different 
approaches to dealing with such problems: 


1. Removing the disturbances by explicit pretreatment of the data 
2. Letting the noise model take care of the disturbances 


The first approach involves removing trends and offsets by direct subtraction. 
while the second relies on noise models with poles on or close to the unit circle. 
like the ARIMA models (1 for integration) much used in the Box and Jenkins 
(1970)approach. 
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Signal Offsets 


We shall illustrate the two approaches applied to the offset problem. The standard 
linear models that we use. like 


A(qg)y(t) = Blg)u(t) + v(t) (14.1) 


describe the relationship between u and yv. That covers the dynamic properties. i.e.. 
how does a change in u cause changes in y. as well as the static properties, i.e.. the 
static relationship between a constant u(t) = u and the resulting steady-state value 
of v(t). say V: 


A(I)¥ = Bn (14.2) 


In practice. the raw input-output measurements, say u” (t). yv” (t), are collected and 
recorded in physical units, the levels of which may be quite arbitrary. The equation 
(14.1) that describes the dynamic properties may therefore have very little to do 
with the equation (14.2) that relates the levels of the signals. In other words, (14.2) 
is quite an unnecessary constraint for (14.1). There are at least six ways to deal with 
this problem: 


1, Let v(1) and u(t) be deviations from a physical equilibrium: The most natural 
approach is to determine the level ¥ that corresponds to a constant u(t) = H 
close to the desired operating point. Then define 


y(t) = y(t) — ¥ (14.3a) 
ult) = u” (t)— u (14.3b) 
as the deviations from this equilibrium. These translated variables will auto- 
matically satisfy (14.2). making both members equal to zero. and (14.2) will 


thus not influence the fit in (14.1). This approach emphasizes the physical 
interpretation of (14.1) as a linearization around the equilibrium. 


2. Subtract sample means: A sound approach is to define 


N N 
= 1 m => 1 m 
Y= yd? G) w= wou! (t) (14.4) 


and then use (14.3). If an input «u™” (r) that varies around u leads to an output 
that varies around Y. then (#4, ¥) is likely to be close to an equilibrium point 
of the system. Approach 2 is thus closely related to the first approach. 


3. Estimate the offset explicitly; One could also model the system using variables 
in the original physical units and add a constant that takes care of the offsets: 


Aig)” (t) = Blau” (H) +a + v(t) (14.5) 
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Comparing with (14.1) to (14.3), we see that œ corresponds to A(1)¥ — B(1)q 
The value @ is then included in the parameter vector 6 and estimated from 
data. It turns out that this approach in fact is a slight variant of the secong 
approach. See Problem 14E.1. 


4, Using a noise model with integration (= differencing the data): In (14.5) the 
constant œ could be viewed as a constant disturbance, which is modeled as 


a 
fa (14.6) 


where 4(f) is the unit pulse at time zero. The model then reads 


B(q) 1 
y"(t) = — u" t) + ———_——- v(t) (14.7 

AQ) (= q7DAl) ) 
where w(t) is the combined noise source «ô (t) + v(t) — v(t — 1). The off- 
set œ can thus be described by changing the noise model from 1/A(q} to 
1/ [a — g )A(q)]. According to what we noted in (7.14) this is equivalent to 
prefiltering the data through the filter L (q) = 1 — q~'. that is, differencing the 
data: 


YEA) = Lig)y"(t) = ya) — "a — 1) 


m m m m (14.8) 

up (t) = L(g)u™(t) = u” (t) — u” (t — 1) 

5. Extending the noise model: Notice that the model (14.7} becomes a special 

case of (14.1) if the orders of the A and B polynomials in (14.1) are increased 

by 1. Then a common factor 1 — q7} can be included in A(qg) and B(q). This 

means that a higher-order model, when applied tó the raw data y”. u”. will 
converge to a model like (14.7). 


6. High pass filtering: Differencing data is a rather drastic filter for removing 
a Static component. Any high-pass filter that has gain (close to) zero at fre- 
quency zero will have the same effect. See Section 14.4 for further discussion 
of prefiltering. 


Evaluation of the Approaches 


In an off-line application with offsets, the approach to recommend would be the first 
one or, if a steady-state experiment is not feasible, the second one. Estimating the 
offset explicitly (approach 3) is an unnecessarily complicated way to subtract the 
sample mean. Differencing the data as in (14.8) corresponds to a prefilter (inverse 
noise model) that has a very high gain at high frequencies. According to (8.71). this 
will push the model fit into a high-frequency region, which is unsuitable for many 
applications. Approach 5 has the additional drawback that more parameters have to 
be estimated. Approach 6 may be a quite useful alternative, especially if the offset 
is slowly time-varying. 
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It is especially important to remove offsets (trends and drifts) when output error 
models are employed. The discrepancy in levels will then be dominating the criterion 
of fit, and the dynamic properties become secondary. For methods that use flexible 
noise models (such as the least-squares method), the problem is less pronounced. 
since the effects of approach 5 will automatically de-emphasize the importance of 
signal levels. 


Drift, Trends, Seasonal Variations 


Methods to cope with other slow disturbances in the data are quite analogous to the 
approaches we discussed previously. Drifts and trends can be seen as time-varying 
equilibria. Straight lines or curve segments can be fitted to the data in the same 
manner as the constant offset levels in (14.4). and deviations from these time-varying 
means are considered. For seasonal variations. several techniques of this character 
have been developed for economic time series. Periodic signals are adjusted to data, 
and then subtracted. See. e.g.. Box and Jenkins (1970). 

Another approach would be to difference the data. analogously to (14.8) or. 
equivalently. to use ARIMA model structures, which include an integrator in the 
noise model. Alternatively. the noise model could be given extra flexibility to find 
the integrator or a complex pair of poles on the unit circle to account for periodic 
variations. In Goodwin et.al. (1986)a comprehensive discussion of the latter problem 
is given. With some knowledge of the frequencies of these slow variations. a better 
alternative may be to high-pass filter the data. This has the same effect of removing 
off-sets and slow drifts, but does not push the model fit into the high frequency range 
as differencing does. See Section 14.4 for further comments on this. 


14.2 OUTLIERS AND MISSING DATA 


In practice, the data acquisition equipment is not perfect. It may be that single values 
or portions of the input-output data are missing, due to malfunctions in the sensors 
or communication links. It may also be that certain measured values are in obvious 
error due to measurement failures. Such bad values are often called outliers. and 
may have a substantial negative effect on the estimate. Bad values are often much 
easier to detect in a residual plot {see Section 16.6). 


Example 14.1 Outliers 
Consider simulated data from the system 
y(t) — 2.85y(t — 1) + 2.717 ¥(@ — 2) — 0.865¥(¢ — 3) 
= u(t — 1)+u(¢ —2) +u(t —3) +e(t) + 0.Je(t — 1) + 0.2e(t — 3) wy 


The values ¥(313)..... ¥(320) were then artificially changed to zero. The resulting 
output plot is shown in Figure 14.1. Although visual inspection shows a possible glitch 
around these values, no serious errors seem to be at hand. An ARMAX model with 
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Output data 
3000 'p 


1000 j 
i V 
0 200 400 600 


Residuais 


-600 
0 200 400 600 


Figure 14.1 Data with outliers. Upper plot: the output signal. Lower plot: The 
residuals from model 6. 


the correct orders is estimated giving the estimate ĝi. The residuals (prediction 
errors) from this model are also shown in Figure 14.1. Now. there is no doubt about 
the data problems around ¢ = 318. Another model was then estimated based only 
on the first 300 data points. This is denoted by 6. Finally. the whole data record 
was used to estimate an ARMAX model applying a robust norm in the criterion 
according to (15.9)-(15.10). That gave the estimate 63. The estimates of the A- and 
B-polynomials are summarized as follows, where 0) denotes the true values: 


6 —2.8500 2.7170 —0.8650 1.0000 1.0000 1.0000 


6, —2.8523 2.7200 —0.8668 -—0.1669 2.6418 0.4159 aan 
6, —2.8504 2.7165 —0.8652 0.9726 1.0496 1.0221 


@3 —2.8557 2.7267 —0.8701 1.0250 1.0185 0.8842 


We see that the few outliers have made the estimate of the B-polynomial in N 
quite bad. It should be added that the estimates “are aware” of this: The esti- 
mated standard deviations of the 3 B-parameters in 6, are given as 0.9997, 1.7157, 
and 1.1421, while those of 6, have standard deviations 0.0602. 0.0750, and 0.0611. 
respectively. = 


To deal with outliers and missing data, there are a few possibilities. One is to 
cut out segments of the data sequence so that portions with bad data are avoided. 
The segments can then be merged using the techniques of Section 14.3. For a data 
set with many inputs and outputs it might be difficult—in certain applications—to 
find data segments that are “clean” in all variables. It is then better to treat outliers. 
both in inputs and outputs, as missing data and view them as unknown parameters. 
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Dealing With Missing Data 


Assume. for the moment. that we have a model M(@) that describes the relationship 
between the input-output data. In the basic linear predictor form (4.6) it is given by 


t t 
Sale) = XO git — k. Ojutk) + Dh — k. 0)yx(k) (14.11) 


k=1 k=l 


Suppose now that some of the input-output data are missing. One must distinguish 
between missing inputs and missing outputs, since they should be handled differently. 
Let us first consider missing inputs. 


Missing Input Data. If the input is a determistic sequence. it is natural to consider 
missing inputs as unknown parameters. Since the above expression is linear in the 
data. it is clear that for a given model M(6@) the missing data can be estimated using 
a linear regression, least squares procedure. If we denote the missing data with the 
vector 7 we have 


t 
Sale. n) = a g(t — k, @)utk) + gi (t. 0n + Sone — k.O)v(k) (14.12) 
kKEKy k=1 


where k € K, is the set of non-missing inputs u{k). and g(t, 9) is made up from 
g(t — k;.0). ki ¢ K, in an obvious way. The parameters 6 and ņ can then be 
estimated by a prediction error criterion in the usual way. Note that for fixed @. 
(14.12) is a linear regression for 7, so missing input data can easily be estimated for 
any given model. It may then be natural (but not necessarily numerically efficient) to 
iterate between estimating the missing data, using the current model, i.e., estimating 
n for fixed @. and estimating the mode] ĝ using the currently reconstructed missing 
data. To start up the iterations. the first model can be built using linearly interpolated 
values for the missing data. 


Missing Output Data. It is not natural to regard missing output data as unknown 
parameters, since they are treated as random variables in the prediction framework. 
The correct prediction error criterion will be to minimize the error between y(t) 
and ¥(t|0. Yx,), where the prediction is based on those past x(k) that actually have 
been observed (k € K,). To compute this prediction correctly we can use the 
time-varying Kalman filter (4.94)~(4.95) and deal with the missing data as irregular 
sampling. To be more specific, suppose that the underlying. discrete time model is 
given in innovations form as (4.91): 


x(t + 1,0) = A(O)x(t, 0) + BO@)u(t) + K(O)e(t) (14.13) 
yt) = C(O)x(t. 8) + eft) (14.14) 


The cross-covariance between process noise and measurement noise (see (4.85) will 
be Ry2(6) = K(@)R2. Now. if some or all components of y(t) are missing at a certain 
time ż, this is treated as time-varying C,(@) and R, 2(@), where only those rows of 
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C and Rj2 are extracted, that correspond to measured outputs. If all outputs are 
missing at time t. C, and R; 12 will be the empty matrices. The time varying Kalman 
filter (4.94)-(4.95) with C,(@) and R, 12(@) inserted into (4.95) will now produce the 
correct predictors 


Salo) = C(O)X(1, 0) = F110. Yk.) 


Working with the time varying predictor of course leads to much more computations. 
and approximate alternatives sometimes may be preferrable. One approximation 
would be to replace any missing v(X) in (14.11) by }(K|@). i.e.. the predictor based 
on data up to time & — 1. (The correct replacement would be to use the smoothed 
estimate of v(k) using measured data up to time t — 1.) Another approximation 
would be to treat also missing outputs as unknown parameters. This corresponds to 
replacing missing ¥(k) in (14.11) by their smoothed estimates, using the whole data 
record. A third possibility is to carry out the minimization of the prediction error 
criterion by the EM-method. see Problem 10G.3. The missing data then correspond 
to the auxiliary measurements X. See Isaksson (1993), 


14.3 SELECTING SEGMENTS OF DATA AND MERGING EXPERIMENTS 


Selecting Data Segments 


When data from an identification experiment or. in particular. from normal operating 
records are plotted. it often happens that there are portions of bad data or non- 
relevant information. The reason could be that there are long stretches of missing 
data which will be difficult or computationally costly to reconstruct. There could be 
portions with disturbances that are considered to be non-representative. or that take 
the process into operating points that are of less interest. In particular for normal 
operating records, there could also be long periods of “no information:~ nothing 
seems to happen that carries any information about the/process dynamics. In these 
cases it is natural to select segments of the original data set which are considered to 
contain relevant information about dynamics of interest. The procedure of how to 
select such segrnents will basically be subjective and will have to rely mostly upon 
intuition and process insights. 


Merging Data Sets 


It is also a very common situation in practice that a number of separate experiments 
have been performed. The reason could be that the plant is not available for long. 
continuous experiments, or that only one input at a time is allowed to be manipulated 
in separate experiments. A further reason is, as described above, that bad data have 
forced us to split up the data record into several separate segments. How shall 
such separate records be treated? We cannot simply concatenate the data segments, 
because the connection points would cause transients that may destroy the estimate. 

Suppose we build a model for each of the data segments, all with the same 
structure. Let the parameter estimate for segment į be denoted by 6", and let its 
estimated covariance matrix be P“’. Assume also that the segments are so well 
separated that the different estimates can be regarded as independent. It is then well 
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known from basic statistics that the optimal way to combine these estimates (giving 
a resulting estimate of smallest variance) is to weigh them according to their inverse 
covariance matrices. 


n n -l 
OP Y POT n PS Sree (14.15) 
i=l 


i=l 


P will then also be the covariance matrix of the resulting estimate 6. This can be 
proved in different ways. See Problem 14E.2 for a matrix-based proof. Compare 
also Lemma II.2 in Appendix II and Problem 6E.3. 

One might ask if this estimate could not be obtained directly from the data 
segments. To see this, let us seek guidance from the linear regression case. Assume 
we are treating the modei 


yi) = 9" (10 (14.16) 
The estimate for any segment will be 
-1 


ĝi 


H 


—] 
Yemen} Poro, P= 1°] Y eine) 


teT' réT! ref! 


F (xo) ~ ony) 


reTé 


RO 


Here Tİ is the index set of the ith segment. excluding those ¢ for which ¢(f) is not 
fully known. This means that the first max(77,, na) samples are excluded from each 
segment for an ARX-model (4.7). It was called the covariance method or the non- 
windowed case in the discussion following (10.13). By |7'| we mean the number of 
time indices in 7". i 

If we apply (14.15) to these estimates. and assume that A“? is independent of 
i, it is easy to see that the resulting estimate is the same as (14.16) would give if 
the summation was carried out over the union U;T' of segments. This is of course 
most natural. Note, however, that for a dynamic model, this is not the same as 
first concatenating the data segments and then applying an ARX-model. Cutting 
away the first observations in each segment eliminates the transient problems at the 
merging points. 

For the general case, with predictor models that have infinite impulse responses, 
this suggests that a criterion should be formed as 


VO) = (vO) — FI) +... + OW — Fee)’ (14.17) 


reT! reT* 


where the filters that compute }(¢/@) for each of the segments should be reinitialized 
with zero initial conditions (or associated with a separate set of initial conditions to 
be estimated). The actual minimization algorithm is however entirely analogous to 
the one described in Sections 10.2 and 10.3. 
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What are the advantages of (14.17) compared to (14.15)? In the first place, it is 
more efficient to use just one minimization run rather than separate ones. Second. 
if each of the experiments (segments) are poorly exciting but provide good excitation 
taken together. then minimizing (14.17) will be better conditioned than minimizing 
each of the sub-sums. Such a situation arises, e.g., when safety or production reasons 
require separate experiments, where one input at a time is manipulated. 


Averaging over Periodic Data 


A different way of “merging” data sets is at hand when an experiment with a periodic 
input has been conducted. We noted in Section 13.3 that it is then advantageous to 
average the output signal over the periods, so that the condensed set consists of just 
one period of input-output data. See (13.34). This allows shorter data records and 
independent noise estimates. 


14.4 PREFILTERING 


Prefiltering the input and the output data through the same filter will not change the 
input-output relation for a linear system: 


yt) = Golg)ut) + Holqjelt) > Lig)y(t) = Golqg)L(q)u(t) + L(g) Ayl(g ett) 


(In the multivariable case all signals must be subjected to the same filter, so that 
L(q) is a multiple of the identity matrix.) The filtering however changes the noise 
characteristics, so the estimated model will still be affected by the prefiltering. In 
this section we shall discuss the role and use of this feature. 


From an estimation point of view, filtering the prediction errors before making 
the fit, as in (7.10). is an important option: 


L 
Ligje.6) = —22_ oa) — Gg, emt) 


ey HG.) 


(14.18) 
= HGO (Lig)y(t) — G(q, @)L(q)u(t)) 


From these expressions we see a few things: 


e Filtering prediction errors is the same as filtering the observed input-output 
data. In the multivariable case the same filter must then be applied to all 
signals. 


e A prefilter L(g) is equivalent to a noise model H (q) = 1/L(qg). We can thus 
interchangeably talk about prefilters and noise models. 
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The noise model/prefilter has three functions: 


1. We know from Section 8.5 that it will affect the bias distribution of the resulting 
model. We shall shortly review these results, 


2. The discussion in Section 9.3 shows that if the transfer function estimate G is 
unbiased, then the best accuracy is obtained for a prefilter that corresponds 
to the true noise characteristics: L(g) = 1/Hp(q). (See discussion following 
(9.42).) 


3. The function of the prefilter may also be to remove disturbances of high or low 
frequencies that we do not want to include in the modeling. 


The second function is the classical statistical one: In order to attain the Cramér-Rao 
bound, we need a correct noise model. Since this is typically unknown. it is natural to 
estimate that too. by including parameters in the noise model/prefilter. For purposes 
l and 3 there is no reason to let the prefilter contain adjustable parameters: On the 
contrary, a parameterized noise model will pull H (q, @)/L(q) to resemble the error 
spectrum (see (8.73)), and this may undo what L(g) was intended to achieve for 
these purposes. The three functions of the prefilter may consequently be conflicting. 

While the second purpose of the prefilter really is a noise modeling issue, the 
two others correspond to pure data preprocessing. We shall now review the use of 
prefiltering for each of these two purposes. 


Affecting the Bias Distribution 


In general, it is not possible to describe the true system exactly within the chosen 
model set. so that the model will be biased. Prefiltering the data may have a substan- 
tial influence on the distribution of this bias. As we found in Section 8.5. the limiting 
model can be interpreted as a compromise between minimizing 


m 2 
6*(D) = arg min f |Gote'”) — G(e'”, 8)|" O(w. O*)dw 
M -T 


E (14.19) 
ILS Pulo) 


Q(w.8) = |\H(ei, @)|2 


on the one hand and fitting |H le”, 0)/L(e)|’ to the error spectrum Per(w. 0*) 
on the other. See (8.73). This means that O(w, @*) will be taken as the weighting 
function that determines the bias distribution of G. This weighting function can in 
turn be affected by properly selecting the 

e Input spectrum ®, (w) 

e Noise model set H (q, @) (14.20) 


e Prefilter L(g) 


468 


Chap. I4 Preprocessing Data 


Notice that it is only the ratio ®,|L|?/|H |? that determines the bias distribution: the 
values of the individual functions ®,, H, and L are immaterial. Note also that the 
interpretation of the role of the prefilter is clear cut only if the noise model H does 
not depend on @. In the general case (14.19) is somewhat heuristic, but it still is a 
quite useful tool to understand and manipulate the bias distribution. We illustrate 
that in the following example. 


Example 14.2 Affecting the Bias Distribution 


Consider the system (8.78) of Example 8.5. The resulting model in the OE-structure 
(8.79) gave the Bode plot of Figure 8.2a. This corresponds to Q(w) = 1 (@- 
independent) in (14.19). Since |G(e'”)| decays very rapidly for high frequencies 
(has a rapid roll-off}. this means that high frequencies play very little role in the 
Bode plot fit. 


0.1 


0.01 


0.01 0.1 1 
frequency (rad/s) 
Figure 14.2 Amplitude Bode plot of the true system and model identified in the 


OE-structure (8.79). with an HP prefilter L,(q) (cf. Figure 8.2). Thick line: true 
system: thin line: model. 


To enhance the high-frequency fit. we filter the prediction errors through a 
fifth-order high-pass (HP) Butterworth filter L;(q) with a cut-off frequency of 0.5 
rad/sec. This changes Q(w) to this HP-filter. The Bode plot of the resulting estimate 
is given in Figure 14.2. The fit has now moved into high frequencies, but clearly the 
second-order model has problems describing the fourth-order roll-off. 

Consider now the estimate obtained by the least-squares method in the ARX- 
model structure (8.81). This was depicted in Figure 8.2b. If we want a better low- 
frequency fit. it seems reasonable to counteract the HP weighting function Q(w. 4") 
in Figure 8.3 by low pass (LP) filtering of the prediction errors. We thus construct 
L(q) as a fifth-order LP Butterworth filter with cut-off frequency 0.5 rad/sec. The 
ARX model is then estimated for the input-output data filtered through L(g). 
Equivalently, we could say that the prediction-error method is used for the model 
structure 
biqé! + lng” i 


nOr ie 1 215 


E E C - 
wo) 1 + aig! + aq L2(q)(1 + aig’ + a2q~*) 
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0.01 


0.01 0.1 1 
frequency (rad/s) 


Figure 14.3 Bode plots of the true system and model identified in the 
ARX-structure (8.81) with an LP prefilter Zaig) (cf. Figure 8.2b). Legend as in 
Figure 14.2. 


The resulting estimate is shown in Figure 14.3, and the corresponding weighting 
function Q(w, 0*) in Figure 14.4. The resulting models in Figures 8.2a and 14.3 are 
quite similar. One should then realize that the ARX estimate for filtered data of 
Figure 14.3 is much easier to obtain than the output error estimate of Figure 8.2a, 


— 


which requires iterative search. Z 


10 


0.1 


0.001 


1e-05 


0.01 0.1 1 
frequency (rad/s) 
Figure 14.4 The weighting function O(w. 6") = Lates]? . 


> 
l- 


lt +aje "+ ate corresponding to the estimate in Figure 14.3. 


Dealing with Disturbances 


The third purpose of prefiltering. as listed in the beginning of this section, is to remove 
disturbances in the data that we do not want to include in the modeling. This actually 
goes hand in hand with the noise modeling aspect of prefiltering: Removing, say, a 
seasonal variation of a certain frequency by a band-stop filter, can also be interpreted 
as fixing a noise model with very high gain in this frequency band. which is a way of 
expressing the presence of the seasonal variation. 
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High-Frequency Disturbances. High frequency disturbances in the data. above the 
frequencies of interest for the system dynamics, indicate that the choices of sampling 
interval and presampling filters were not thoughtful enough. This can however be 
remedied by low pass filtering of the data. Also. if it turns out that the Sampling 
interval was unnecessarily short. one may always resample the data by picking every 
sth sample from the original record. Then. however, a digital antialias filter must be 
applied before the resampling. in the same manner as discussed in Section 13.7, 


Low-Frequency Disturbances. Low frequency disturbances in terms of offset. drift, 
and slow seasonal variations were discussed in Section 14.1. A very suitable method 
to deal with such problems is to apply high pass filtering. This must be considered as 
a Clearly better alternative to data differencing. 


14.5 FORMAL DESIGN OF PREFILTERING AND INPUT PROPERTIES 


To secure good models, we have a number of design variables, as discussed in Chapter 
12. In addition to designing the experiment according to the advice of Chapter 13, 
we also have the prefilter as an important design variable. We shall in this section 
return to the formal design problem (12.7)-(12.9) and consider the joint design of 
prefilter and input signal properties. 


Optimizing the Bias Distribution 


Let us first turn to the formal design problem (12.29) for the bias error. in the special 
case 


(14.22) 


Clw) = lew | j 


0 0 
With (14.22), (12.29) can be rewritten 


T k) 
Dop = arg min f |G (e'. 0*(D)}) — Gole')|" Cu(w)dw (14.23) 
-7 


Let us also specialize to open-loop operation: ®„e(w) = 0 and the noise model to 
be fixed to 1: H(q. 0) = 1. (Since we allow prefiltering, this includes the case of any 
given, fixed noise model.} We have the input spectrum and prefilter 


D = (9%, (-), L(@)} (14.24) 


as design variables. In this case (14.19) specializes to 


“dw (14.25) 


8* (D) = arg min | |Go(e"”) = Gee”, o)? ©, (w) ILe) 


M J —7 


We are thus faced with the minimization problem (14.23) with the function 6*(2)) 
defined by (14.24) and (14.25}. This problem has an explicit solution. 
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Theorem 14.1. Consider the optimization problem (14.23) with 6*(D) defined by 
(14.24) and (14.25). The minimum is obtained for any D such that 

,(w) |L} = a - Cu) (14.26) 
provided this is an admissible design for some positive scalar a. 


Before turning to the proof, we note that the theorem says that the signal-to- 
noise model ratio should be chosen proportional to the criterion weighting function, 
and this can be accomplished either by input design or by prefilter/noise model se- 
lection. Some extensions to this theorem are outlined in Problems 14G.1 to 14G.3. 
Note in particular that for (14.22). open-loop operation indeed is optimal. That is. if 
®,,.(w) is included among the design variables (14.24), then the optimal solution is 
®,,-(w) = 0, together with (14.26). 


Proof. First we establish the following lemma. 


Lemma 14.41. Let V(x. y) be a scalar-valued function of two variables such that 
each may take values in some general Hilbert space. For a fixed yv. let 


x*(y) = argmin V(x. y) (14.27) 
x 


and for a fixed z, let 
y*(z) = arg min V (x*(y). z) (14.28) 


assuming that these minimizing values are unique and well defined. Then 
y*(z) =z (14.29) 


Proof of Lemma 14.1: By the definition (14.27). 
V (x*(2),z) < V@w.z), Yx, Yz 


Hence 

V (x*(s),z) < V(x*(y),2). Yy (14.30) 
From (14.28). by definition, 

V (x*(v"(z)).z) < V(x*(y).z) Yy (14.31) 


Now. (14.30) and (14.31) imply 
x* (y*(z)) = x*(z) 


which implies (14.29) since the mapping x* is assumed to be injective (follows from 
the assumed uniqueness of the minimum in (14.28)). End proof of Lemma 14.1. O 


Remark. The assumption of uniqueness in (14.28) can be relaxed by consid- 
ering instead the set 


yaz) = {stv (x*(y),z) = min V (<"09-2)] 


The statement corresponding to (14.29) then is z € ¥*(z) which is sufficient for our 
purposes. 
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To return to the problem (14.23). introduce the function 


WO, K(-)) = J [G(e. 6) — Gole) | klwdw (14.32) 


and define 
8 (k(-)) = arg min W (9. «(-)) (14.33) 


We thus have. from (14.25), 
6"(D) = 8 (Puto |L) (14.34) 
Since the limiting estimate 0* depends on the functions in D only via the product 
klw, D) = O, (w)|L(e)|” 
the optimal design problem (14.23) is in fact a search for the best x: 
Kopi = arg min W(0*(D). Cu) = argmin W (Pit). Cat)) (14.35) 
Applying Lemma 14.1 to (14.35). we find that 


Kop (@) = æ + Ciulw) (14.36) 


where œ is any positive scalar such that the design is admissible (D € A). This 
follows since scaling x does not affect the problem (14.35). This concludes the proof 
of Theorem 14.1. C 
Optimizing the Mean Square Error i 


Theorem 14.1 solves the problem of minimizing the bias contribution to the criterion 
(14.23). In (13.83) we minimized the variance contribution to the same criterion. It 
is easy to see that both criteria can be minimized simultaneously, which means that 
we can solve the total error problem (12.30). This gives the following result. 


Theorem 14.2. Consider the problem of minimizing the mean square error 


T 


ee 42 
arg min f E|Ge®, ôn (D) — Gole”)| Ci(w)dw (14.37) 


with respect to the design variables prefilter, input spectrum, and cross spectrum 
between input and noise (i.e. feedback mechanism) 
D = {L(q). Pul), Pue()} (14.35) 


under the constraints 


Pig 
J ®,(w)dw <a, H(q,9) =1 (14.39) 


T 
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The solution is 


D, (w) = Hiv Cilo) Py (@) 
Doi = Puelo) = 0 (14.40) 
[LeeL] = mj GS 


Here u, is adjusted so that the input power constraint is met, while 4» is a constant, 
such that the filter L(g) is monic. 


This result could be obtained easily since the two components of the mean 
square error in (12.26) could be minimized simultaneously with respect to D, and no 
compromise had to be made. For other design variables, typically the model order. 
bias and variance are conflicting and an optimal trade-off has to be met. 

Example 14.3 Pole Placement 


Suppose we intend to use the model to design a regulator that achieves a certain 


closed loop system R(q). Suppose the true system is Go and we use the model G. 
A regulator that gives the desired closed loop with the model is 


u(t) = F,(g)r(t) = Fy(q)y@) 
where F», F, are subject to 
ÔF- (a) = Rg) (1 + 6@)F,@)) 


Here r is the reference signal. and we want to achieve y = Rr. The difference 
between the desired and actual closed loop is 


6. = GoF, Ra Gof, GF, 
TL + GF  14+GoFy 14+ 6F, 
(Go — G)F, (Go—G)R _ (Go—G)R 


— pee ce aa 


where the last step holds if G is close to Go. The size (variance) of the error ¥ = 
R 
Goll + GoFy) 


Consequently the experiment that gives the best model for this regulator design 
under the constraints (14.39) is 


r 


f IGo = ĜP - Cudw. Cy = 


|R¢e'”)| /®,@) | Hole™®)| 
|Go(e)| |1 + Gole@) Fy (e'”)| 


GORCA 
|Go(e)| |Hotei®)| |1 + Gole) Fy (e'®)| 


p lw) = y 


Lort (et?) = 
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We see that the characteristics of the true system are required in order to compute 
the optimal! design. Even though these may not be known in detail. the expressions 
are still useful. They tell us to spend the input power where 


| Rte) | 
f Gae l »> l 


2. The reference signal is going to have energy: ®, (w) large 
3. The disturbances are significant: | Ho(e'”)| large 


4. The sensitivity due to feedback is poor: |1 + Golei?) Fy(e'”)| small 


1. A gain increase 1s desired: 


These suggestions are as such quite natural. but their formalization is useful. D 


Identification for Control 


Control design is one of the most important uses of identified models. Considerable 
effort has therefore been spent on designing experiments and methods that give mod- 
els well suited for control design. Feedback control is both forgiving and demanding 
in the sense that we can have good control even with a mediocre model. as long as 
it is reliable in certain frequency ranges. Loosely speaking, the model has to be reli- 
able around the cross-over frequency (* the bandwidth of the closed loop system), 
and it may be bad where the closed loop sensitivity function is small. The required 
accuracy of the model therefore depends on the (unknown and to-be-designed) sen- 
sitivity function. We saw the basic features of the experiment design in Example 
14.3. and even though the optimal choices depend on unknown facts. it mav be suffi- 
cient for a successful application. The interplay between the model and the criterion 
C11. which depends on the regulator. which in turn depends on the model. has led 
to a substantial literature on “identification for control.” This frequently involves 
iterative approaches. where a sequence of experiments are performed, interleaved 
with evaluations of the preliminary regulator designs. Foran overview, see Gevers 
(1993). 


14.6 SUMMARY 


Preprocessing of data is an important prerequisite for the estimation phase. It mav 
involve “repair” of the data in terms of replacing missing or obviously wrong data 
as well as merging disjunct data sets. It typically also involves data polishing by 
removing undesired disturbance features in the data. This is accomplished primarily 
by low-pass or high-pass prefiltering and/or subtracting offsets and trends from the 
data. Notice that if the data are prefiltered, the noise model is affected. If a very 
specific effect is desired with the prefiltering, it may therefore be wise not to iet the 
noise model be flexible. 

Prefiltering also affects the distribution of bias over the frequency range. to- 
gether with other design variables. We showed in this chapter how, in some cases, 
formal design criteria could be optimized with respect to the design variables. More 
important, however, is that insights into the mechanisms that govern the bias distri- 
bution allow thoughtful design that secures a good fit in important frequency ranges, 
even when the optimality results are not directly applicable. 
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14.8 PROBLEMS 
14G.1 Consider the design problem (14.23) with ®,e (w) = 0 and a given noise model set 
H = (H(q.n)} 
that is parametrized independently from the transfer function model set G. Show that 
the solution to (14.23) then is 
Pu lw) 


ul) =a Cio) 
| H(e, n")! 


opt ` 


where n” denotes the resulting noise model parameter. 
Hint: Establish first the following corollary to Lemma 14.1: Let x*(y) be defined 
as in the lemma. Let f(y) be an arbitrary function. and let 


y"(z) = arg min V (x* (f(¥)) .2) 


Then f (¥*(z)) = z (reference: Yuan and Ljung, 1985). 


14G.2 Consider the design problem (12.29) with a general matrix C(w). Assume that the 
noise model is fixed to H, and chosen a priori (thus it does not belong to D). The 
design variables consist of ®, and Pae. Show that the solution to (12.29) is to select 
®,, and „e so that 


1 | D, (w) | _ ae ol 
|H.(e)|” P,,e(—w) * Cai (w) * 
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Here * means that the element in question does not affect the optimal design. ¢ is an 
arbitrary positive scalar (reference: Ljung. 1986a). 


14G.3 Consider the design problem (14.23) subject to the design variables 
D = {O,(-). Buel). H,(q)} 
Show that the optimal design is 


pon 
PP'(w) = 0, ee CUE = & : Culo) 


|as "(el) 


Hint: Use Problem 14G.2 (reference: Ljung, 1986a). 
14E.1 Consider the model (14.5) and introduce 


gmt) = [-y™@ 1)... y” Ma) wD. et n) 1) 
6 = [ai ~. ana bi- -bnp a)” 


Derive an expression for the LS estimate 0y. Show that with the approximations 


N 

1 we 

W ) y(t —k) ® ¥, foralll < k < na 
=1 


N 

1 

glet- Bed foralll < k < n, 
t=1 


the estimates of a; and b; are identical to those obtained by subtracting sample means 
as in (14.4) and (14.3) and then applying the LS method to (14.1). 


14E2 Prove the following matrix identities: 
G—Z)P\U — Z) + ZBZ? = Pi — PRP, + (Z 4 P,RO)R(Z — PR 
P,R™! = (Py! + PIP. where R = (P, + Ps) 


Use this to prove that if 6; are unbiased estimates of (i.e. EO; = 6) with variances 
P; . then the variance of 


0 = a0, + a262, Fé = Op 


is minimized for a; = (Pr' + pP y~ Po". 
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CHOICE OF IDENTIFICATION 
CRITERION 


The selection of an identification method is one of the important decisions to be taken 
in system identification. In Chapter 7, we have structured the abundant supply of 
candidates. Different aspects and results on various methods have been mentioned 
in earlier chapters. It is the purpose of the present chapter to sum up these and also 
discuss how different options affect the properties of the resulting estimates. 

A general discussion is given in Section 15.1. The particular problem of choos- 
ing the norm €(-) for the prediction-error method (7.155) or the shaping function 
a(-) in (7.156) is studied in Section 15.2. Section 15.3 deals with optimal instruments 
for the [V method and their approximate implementation. Section 15.4 summarizes 
some basic advice to the user. 


15.1 GENERAL ASPECTS 


We have described three basic approaches to identification in this book, each asso- 
ciated with some design variables: 
1. The prediction-error approach (7.12) 
e €(-): norm 
o H: noise model set, including prefilter L(g) 
2. The correlation approach (7.110) 
e a(-): shaping function 
e L(q): prefilter 
è ((f.@): correlation vector 
3. The subspace approach to estimating state-space models (7.66). Section 10.6 


è ¢;(t): correlation vector, corresponding to the regressors for which the 
k-step ahead predictors are determined 

è r: Maximum prediction horizon 

e W,. W:: The weighting matrices in (10.127) 

e R: The “post-multiplication matrix” in (10.128). 
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The choice between the approaches and the choices of design variables within the 
approaches are guided by a number of issues: 


Applicability 


The prediction error approach has the advantage that it is applicable to all mode} 
structures. linear and nonlinear. tailor-made and black-box-parameterized. It is also 
valid for systems operating in open as well as closed loop. The minimization code is 
essentially the same, only the computation of the predictor and its gradient is specific 
for the model structure. 


The correlation approach is also, in principle. generally applicable, but it is most 
naturally used only for the linear black-box family (4.33). Most of its use is really as 
an IV-method for the ARX-model. Closed-loop applications require special care in 
the choice of instruments. 


The subspace method is specifically designed for black-box linear systems in 
state-space form. Also for this method, it is necessary to use special solutions for 
closed-loop operation. 


Bias Considerations 


Will the method give unbiased estimates incase S € M? All the listed methods for 
all generic choices of design variables will give consistency in the case of open loop 
operation. The prediction error approach has the added advantage of guaranteeing 
consistency for systems operating in closed loop as well, in case the model structure 
(including noise model) contains the true system. 


If S g M, can the bias distribution be clearly pete and affected? Here 
prediction-error methods have a clear advantage. Expressions (8.71) and related 
ones describe quite clearly in what sense the model approximates the true system in 
the linear case. For fixed noise models and open loop systems, it is easy to control 
the frequency emphasis by prefiltering. The approximation aspects of the correla- 
tion methods can be written down, but are less transparent. The exact nature of the 
approximation properties of subspace methods and how these are affected by the 
design variables are not yet fully understood. 


Variance and Robustness Considerations 


Optimizing (=minimizing) the variance is a simpler problem than optimizing bias. 
The reason is that we have explicit expressions, from Chapter 9, for how the variance 
is affected by the design variables in the prediction-error and correlation cases. For 
the subspace method. it is currently not fully known how the design variables affect 
the variance. Some partial results were quoted in Section 10.6. 

We know from Chapter 9 that the theoretical Cramér-Rao lower bound is 
asymptotically achievable by the maximum likelihood method. so in a sense we 
know the answers beforehand to all variance optimization questions: 
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e Bestchoice of £(x) = — log f-(x) (f-(-) being the PDF of the true innovations) 


e Best choice of noise model/prefilter = true noise description (possibly esti- 
mated) 


e Best choice of IV method = make it equal to the preceeding prediction error 
method 


Still, as we shall see. the MLE may not be the best approach in all cases. The reasons 
may be the bias effects discussed previously, or that the estimates are sensitive to prior 
knowledge that may be imprecise. We shall, in Section 15.2. look into robustness 
issues for the choice of norm £ in the prediction error approach. In Section 15.3 
we Shall study variance-optimal instruments for the 1V method, and see if optimal 
accuracy can be obtained even without iterative search. 


Ease of Computation 


The subspace methods have the important advantage that the algorithms do not 
contain iterative search. They can also be implemented using numerically robust 
algorithms. The IV-method has a similar advantage that it can estimate the dynam- 
ics of a linear system (but not the noise properties) without iterative search. The 
prediction error methods, except in the linear regression case, must rely upon itera- 
tive search methods, and may be trapped in false solutions that correspond to local 
minima. 


15.2 CHOICE OF NORM: ROBUSTNESS 


From (9.29) and (9.30) we know how the choice of £(-) in the prediction-error ap- 
proach affects the asymptotic variance, provided S$ € M. The covariance matrix 
(9.29) is scaled by the scalar 


K(€) = pA (15.1) 
[EE"(eo(t)) 

According to Problem 9G.4, the shaping function a(x) in the correlation approach 

scales the covariance matrix (9.78) by the same scalar (15.1) with E(x) = a(x). 

Hence the choices of £ and æ can be discussed simultaneously. The scalar x depends 

only on the function €(x) and on the distribution of the true innovations ep({t). Let 

the PDF of these be denoted by fe(x). Then 


_ JEY fedx 


£) = K (£. fe) = 
TRE Re cays T 


(15.2) 


Here prime and double prime denote differentiation with respect to the argument x. 
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Optimal Norm 


If we focus on variance aspects. our objective is to select € so that «(¢. fe) is min- 
imized. Notice that this problem. as such. is quite independent of the underlying 
system identification problem, the particular model] structure used, and so on. We 
have the following result. 


Lemma 15.1. 
k(l, fe) = K(— log fe. fe). v£ (15.3) 


Proof. We have by partial integration 


| E(x) fe(x)dx = — | L(x) fe(xdx 


fex) 
[fe "(x rar pdx] 


[ERRA JE felx)dx 


Fox) 
fe(x) 


Cauchy's inequality now gives 


iJ Ua fends | 


lA 


with equality when 


(x) = Ci 


which proves that ; 
tx) = Glog fa) +C Í 


gives the minimum in (15.3). = 


The lemma tells us that the best choice is 
lop (€) = — log fe(e) (15.4a) 


which can be seen as a restatement of the fact that the maximum likelihood method 
is asymptotically efficient. The lemma deals with the case of a stationary innovations 
sequence {e9(t)}. If the distribution of e9(t) dependson?. fe, (x.t). then the optimal 
norm is also time varying. 


opt (€.t) = — log felé. t) (15.4b) 


This follows from the fact that (15.4b) gives the MLE. It can also be established 
directly: see Problem 15T.2. If the innovations are Gaussian, with known variances. 
then (15.4b) tells us to use a quadratic norm. scaled by the inverse innovation vari- 
ances. 
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An annoying aspect of these results is that the PDF fe. may not be known. 
There are two remedies for this: to simultaneously estimate fe or to select an £ that 
is insensitive to the different € that possibly could be at hand. 


Adapting the Norm 


In the first case we include additional parameters a: £(€, œ) so as to allow for an 
adjustment of the norm. Then we know from (8.59) that the norm will adjust itself so 
that it becomes close to the optimal choice (15.4). If a reasonably good estimate of 
fe in (15.4) will do. the adaptive norm approach will be a good solution. This might. 
however, not be the case, as we shall now demonstrate. 


Sensitivity of the Optimal Norm 


The optimal variance scaling «(€, f) in (15.2) could be quite sensitive with respect 
to the PDF f. That is. asa function of f , the scalar x (— log fe, f ) could have a very 
sharp minimum at f = f,. This is illustrated in the following example. 


Example 15.1 Sensitivity of Optimal Norm 


Let the nominal PDF f,, be a normal with variance 1. 


1 2/9 
f(x) = ——e* = plx) 


Jin 


Then — log f(x) = $x? (disregarding a constant term) and 


? 
k(—log fe. fe) = UA =1 (15.5) 
Lf e(x)dxP 
Suppose now that the prediction errors with a very small probability can assume a 
certain large value. This could, for example, correspond to a certain failure in the 
measurement or data transmission equipment. Such data are called outliers. We 
thus assume that ¢ is almost normal. but with probability 310-3 it may assume the 


value 100, and with probability 107° the value —100. The actual f then is 
f(x) = 0 — 1074)g(x) + 1077 [}8(x — 100) + $8(x + 100)] (15.6) 
This gives 
x(—log fe, f) = (1 — 107°) + 10° - 107° = 10.999 (15.7) 


The variance thus becomes 11 times larger, even though the change of probabilities 
in absolute terms was very small. C 


Robust Norms 


It is obvious that such a sensitivity to the true PDF fe. is not acceptable in practical 
use. Adapting the norm is not the solution in most cases, since a finite number of data 
may not render an accurate enough estimate of the best norm. Instead we must look 
for norms that are robust with respect to unknown variations in the PDF. This is a 
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well-developed topic in statistics; see, for example, the monograph by Huber (1981). 
A useful formalization is to seek the norm £ that minimizes the largest variance 
scaling that may result in a certain class of PDFs: 


lop = arg min max K(€, f) (15.8) 


This norm gives, in Huber’s terminology, the minimax M-estimate. The problem 
(15.8) with «(€, f) given by (15.2) is a variational problem whose solution depends 
only on the family of f functions. It is thus decoupled from its statistical context. and 
the discussion of (15.8) in, for example, Chapter 4 in Huber (1981 )applies equally 
well to our system identification framework. f 

Typical families F are environments of the normal distribution. much in the 
spirit of our example. Solutions to such problems have the characteristic feature 
that €’(x) behaves like x for small x, then saturates, and may even tend to zero as x 
increases (“redescending” £’). Some typical curves are shown in Figure 15.1. 

Let us return to our example to check how such £ may handle outliers. 


0 


Figure 15.1 Some typical robust choises of €°(x) 


Example 15.1 (continued) 
Let €,(x) be such that 


x, Ix] <4 
tax = {4 x24 
—4 x < —4 


Then with f as in (15.6) 
0.999 f <4 x7 p(x)dx + 0.999 fi. 416 - p(x)dx + 0.001 - 16 


k(l f) = s poaz] 


= 1.015 
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The decrease in variance compared to (15.7) is drastic. Let us also check what we 
loose in optimality if the true PDF indeed was normal: 


Ses oada + fi. 416 - pdx 


K(Ex. fe) = 5 = 1.0001 
[Sns pada] 
This price of increased variance for the nominal case is thus worth paying to obtain 
resilience against small variations in the PDF. D 


As a recommended choice of robust norm, we may suggest the following, see 
Ruppert (1985)and Hempel et.al. (1986), p 105. 


x Ix] <p: 
(Ra) = {p ê x> 9-6 (15.9) 
—p:G x<-p-¢ 


Here ő is the estimated standard deviation of the prediction errors, while p is a 
scalar in the range 1 < p < 1.8. The estimate G in turn should be robust so that it 
is not disturbed by outliers. A recommended estimate is to take 


MAD 
mae 15.10 
= 07 ey) 


Here MAD = the median of {|e(t) — &}} with £ as the median of {e(t)} 


Influence Function 


A basic idea behind robust norms is to limit the influence of single observations on 
the resulting estimate. It is reasonable to ask how the estimate would change had 
a certain observation been lacking. For the least-squares estimate, an exact answer 
can be given. It follows from (11.9) that 


R (N)g(t)[y(t) — OF ,g(0)] 


R, (Net) — ôF] (15.11) 


Oy = Ont 


where R is the actual estimate and the Ên is the estimate with the measurement 
(x(t), g(t)) removed. Also, 


N 
RIN) = Yee), RAN) = RIN) — pye) 
k=1 


The influence of measurement (y(t), g(f)) can thus be evaluated by 


R, (Nyel, by) 


484 


Chap.15 Choice of Identification Criterion 


This is, somewhat simplified, the idea behind Hampel’s (1974)influence function. 
Generalized to arbitrary norms and model structures, the influence of measurement 
t can approximately be evaluated by 


Sit) = R, (Ny (t. by Llet. Bx) 


N 
= i ee tere ; (15.12 
RAN) = So wtk.dvreelk. ÔNDY Tk, Ôv) l 


k=l 
Ast 


Compare (11.52). The objective with robust norms @ as in Figure 15.1 could then be 
expressed as minimizing 


max 1S(t)| 


More pragmatically. it makes good sense to critically evaluate S(t) in (15.12) after 
the parameter fit has been completed so as to reveal which observations have con- 
siderably influenced the estimate. Such observations had better be reliable or their 
influence should be reduced. 


Detecting Outliers 


Outliers that are as drastic as in Example 15.1 and data points that have a large 
influence on the estimate will often be detected by eye inspection of the data record. 
It is good practice, even when robust norms are used. to display the data before they 
are used for identification. Outliers are most easily detected in plots of the residuals 
e(t. 6y). See Example 14.1 and Section 16.5. 


Multivariable Case (*) f 


The covariance matrix for multivariable systems is given by (9.47). Choices of the 
function €(€) from R? to R are quite analogous to the scalar case discussed pre- 
viously. The multivariable case introduces a new issue. though: How should the 
different components of the vector € be weighted together? We shall illustrate this 
question by considering the family (7.28) and (7.29) of quadratic criteria. 

The covariance matrix associated with the quadratic norm AT! in (7.27) is. 
according to (9.47) (see also Problem 9E.4), 


Po A) = [EYU A W(t. 0)] [Evt 0AT AoA W(t, 60)] 
x [Ev AWM, 6)] 


Here Ag = E en(red (t) is the true innovations covariance matrix. It is straightfor- 
ward to establish (cf. Problem 9E.4) that 


Po(A) > Py(Ao) VA 
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so that the best norm requires knowledge of the true covariance. Since anyway the 


minimization of (7.29) is carried out iteratively. the following variant to implement 
A = Au suggests itself: 


N 

Aig l : Aur a 

AY = Hy REU Ôp ET, OR") 
t=l 


‘ (15.13) 
at = are min - Yoel. 8) Kan e(t. 0) 
=] 
where superscript (7) denotes the /th iterate. 
It is interesting to note. though. that the criterion 
; i 
ôy = arg min det È 2 Gjet. J (15.14) 


gives the same asymptotic covariance matrix for 6x as the quadratic norm (7.27), 
with A being the true innovations covariance. See Problem 9E.5 and also (7.92) to 
(7.96). 


15.3 VARIANCE-OPTIMAL INSTRUMENTS 


Consider now the IV method (7.129). Under assumptions (9.80) and (9.81) the 
asymptotic covariance matrix Py of the estimates is given by (9.83). Clearly, the 
choice of instruments ¢ (t. &) and the choice of prefilter L(g) may have a consider- 
able effect on Py. Now. what might the optimal choices be? 


A Lower Bound 


Suppose that the true system is given by 


y(t) = Go(qg)u(t) + Hol(g eat) (15.15) 
and that the transfer function Go is to be estimated. while Hy is assumed known. 
Let 

Bo(q) 
Gol(qg) = (15.16) 
aM Av(q) 
and let the model be parametrized as 
B 
G(q.@) = BQ) (15.17) 
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with appropriate model orders. The limit of the Cramér-Rao bound for this estima- 
tion problem is given by (9.31) (normality assumed here): 


Por = AglE W(t. 8o) Y (rt. Oo}! (15.18) 
d T 
w(t.) = Pace | 
dé 10~6, 
[—Go(q)u(t — 1) 
"| m í — wee 
Ho(q)Ao(q) og 


—Golg)u(t — na) u(t — 1)... u(t — np)] (15.19) 


[cf. (10.55); here we have used A for what is called F in the model (4.33)]. 


Optimal Instruments 


Equation (15.18) gives a lower bound for any unbiased method aiming at estimating 
9 in (15.17). It thus applies also to the IV method, so (15.18) is a lower bound for 
(9.83). However, this lower bound is achieved for 


1 
A(q) Ao(q) (15.20) 
CPt) = w(t, &) 


L” (q) = 


To see this. we note that 
pF = L™ (gpl) = wit, Oo) + Ge(t) 
where Q(t) depends on {eo(t)} only, while y(t, 0o) depends on {u(t)} only. Hence 


Eyt.) = Eya. awe.) 


and (9.83) simplifies to (15.18). when the system operates in open loop. The optimal 
design variables for the IV method are thus given by (15.20). 


Adaptive IV Methods 


While (15.20) gives direct advice about the best IV method. the annoying aspect 
is that the optima! prefilter and instruments depend on unknown properties of the 
true system. This could be handled by letting the instruments ¢ (t, 0) depend on # 
in a proper way and by simultaneously estimating the noise properties. This leads 
to algorithms that are closely related to and of about the same complexity as the 
corresponding prediction-error method. An alternative is to approximately realize 
the optimal choices in a multistep algorithm. 


Multistep Algorithm 


The choices of instruments and prefilters in the IV method primarily affect the asymp- 
totic variance, while the consistency properties are generically secured. This suggests 
that minor deviations from the optimal values according to (15.20) will only cause 
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second-order effects in the resulting accuracy. It could thus be sufficient to use con- 
sistent. but not necessarily efficient. estimates of the dynamics and of the noise when 
forming the instruments and prefilter. To retain the appealing simplicity of the IV 
method, we should work with linear regression structures for these steps. We thus 
suggest the following four-step IV estimator for a system that operates in open loop. 


Step 1: Write the mode] structure (15.17) as a linear regression 
(110) = 97 (1)0 (15.21) 
Estimate @ by the LS method (7.34). Denote the estimate by 6y) and the corre- 
sponding transfer function by GY@q) 


Step 2: Generate the instruments as in (7.122) and (7.123): 


Ati 


xP) = Gy (ult) (15.22) 


T 
cGy = [ x — 1)... — xt — na) a(t — 1)... u(t — na) | (15.23) 


and determine the IV estimate (7.118) of 6 in (15.21) using these instruments. Denote 
the estimate Oy and the corresponding transfer-function estimate 


(2 
G2 ie By ( ) 
Ay (q) 


Step 3: Let 
wy (t) = AGAO — BY’ (que) 
and postulate an AR model of order na + np (order chosen to balance the compu- 
tational efforts in each step) for wy (OD: 


LQG) = elt) 


Estimate L(q) using the LS method and denote the result by Ly(q). 
Step 4: Let x(t) be defined analogously to (15.22). and let 
cr) = Èy l -xa — 1)... xP — ng) ult — 1)... ult — ny) JF 
(15.24) 


Using these instruments and a prefilter x(g) in (7.129) with a(x) = x. determine 
the IV estimate of @ in (15.21). giving the final estimate 


N 


N 5 
by = [Leoz] SOOD) 


f=] =l 


(15.25) 


Engt. yr) = Ew(q)y@) 


Yr (t) 
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This algorithm is a special case of a multistep procedure discussed in Section 6.4 
of Söderström and Stoica (1983). They show that the asymptotic covariance matrix 
of by indeed is the Cramér-Rao bound (15.18). provided (in our case) the truc Hy 
is an autoregression of order np. 


Example 15.2 Four-Step TV Algorithm 


The system 
1.0q~? + 0.5973 
(H) = —— ult) + elt 
"D= eaa g 1 
was simulated over 400 samples with {e(t)} white Gaussian noise of variance | and 
{u(t)} as a white binary +1 signal. A second-order ARX-model structure with 
two delays was used. The four-step IV method gave the following transfer-function 
estimates: j 
à 1.0597g~* + 1.1546q~> 
l 
Gwl) = — r 
1 — 1.025547! + 0.2965q-- 
xa 0.9778q~* + 0.2750g~7 
(2) 
Gay De a a sone 
1 — 1.6072g—' + 0.78034 -- 


0.96889 ~? + 0.5216q¢73 


G aly a alent Aeon ea en 
wla) = T= 1.503847! + 0.7023qg-? 


z] 

The example indicates that the extra work of steps 3 and 4 is worthwhile. An 

additional advantage is that these steps provide a noise characteristics estimate. nec- 

essary for the computation of the asymptotic covariance (9.83). In fact. as remarked 

previously. the particular choices of ¢ and L give the following estimated covariance 
matrix of Ox: í 


N -i 
ZP În p c'h) peo (15.26a) 


I 


N 
A 1 Teh 2 
in = y 2 [sro - oF (nbn | (15.26b) 


15.4 SUMMARY 


It is an unavoidable consequence of the structure of this book that useful results 
and advice on various identification methods are scattered over several chapters. 
Therefore, we give here a concrete user-oriented summary of suggested parametric 
identification procedures. 

We consider prediction-error methods (PEM) to be the basic approach to svs- 
tem identification. They have three important advantages: 
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1. Applicability to general model structures 


2. Optimal asymptotic accuracy when the true system can be represented within 
the model structure 


3. Reasonable approximation properties when the true system cannot be repre- 
sented within the model structure 


For a given model structure }(t|9), the PEM can be summarized as follows: 
Select a prefilter L(g), guided by the discussion of Section 14.4. Form the criterion 


N 
; 1 
Ny E . 
Vy (0, Z") = N 3 t(er(t.0)) 


eFt, 0) = Lig)[y(t) — $ 
£(-) given by (15.9) 


For a linear black-box model, form an initial estimate ĝi) by the procedure 
(10.79). Then minimize Vy iteratively using the damped Gauss-Newton method 
(10.40). (10.41). and (10.46) [{10.47) when necessary]. The asymptotic properties of 
the resulting estimate 6x are then given by (8.103) and (9.90). 

However, it is still true that other methods may be preferable in certain cases. 
Especially for a linear system with several outputs, that requires a model structure 
with many parameters, the subspace methods form a valuable alternative. They 
have the advantage of allowing an estimate. using efficient and numerically robust 
calculations without iterative search. 

The main advantage of the /V method is its simplicity. It is often worth-while 
to use the four-step procedure (15.21)-(15.26), as well as the subspace method. for 
a first quick estimate of the system transfer function. This may then be refined by 
PEM, if necessary. 

We may note that it is not necessary to be able to tell which of the approaches 
is “best.” Experience says that each may have its advantages. It is good practice 
to have them all in one’s toolbox. compute models with the different methods, and 
subject them to validation according to the next chapter. 


15.5 BIBLIOGRAPHY 


Robust norms have been extensively discussed in the statistical literature. Huber 
(1981)gives a comprehensive account of robust estimates. Hempel et.al. (1986)dis- 
cusses the use of the’influence function. Krasker and Welsch (1982)describe how 
to “robustify” linear regressions also with respect to the magnitude of the regres- 
sors. Polyak and Tsypkin (1980)have advocated the use of robust norms for on-line 
applications. Tsypkin (1984)contains a comprehensive treatment of this application. 

Optimal instruments and their approximate implementation have been exten- 
sively studied by Söderström and Stoica (1981, 1983)and Stoica and Söderström 
(1983). Several mixed IV-PEM schemes have been studied also by Young and Jake- 
man (1979). 
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15.6 PROBLEMS 


15G.1 


1SE.1 


15T.1 


15T.2 


15T.3 


Consider an identification criterion 


N 
y l% 
Vx(0, Z) = — Ñ atlet. 0), 1) 
N 
1=1 
where the functions £(-, f) are given and the positive scalars œ, are to be selected. 
Suppose that S € M and that the true innovations {eg{t)} are white, zero mean. but 
not necessarily stationary. Show that the choice of {a,} that minimizes the variance of 
the parameter estimate is 


a = | Elé(en(t). OP T 


- = (times arbitrary scalin 
(Ee (ent). DE } 8) 


Conclude that. if €(x.1) = x? all z. then the optimal weights œ, are the inverse 
variances of the innovations ég{t), regardless of their distribution. {Compare (11.63},] 
Conclude also that, if e9(z) is Gaussian with variance A, and E(x) = |x|. then «P= 


1/ V}. 


Consider the output error model structure 
Fale) = Giq, Out) 
Suppose that the true system is given by 
vE) = Golq)utt) + Hotq)eolt) 


where e(t) is white noise. Suppose also that Go(q) = G(q. 8u). Apply a prediction- 
error method with prefilter L(g) to estimate 6 and compute the variance of êy. Show 
that the variance is minimized for L(q) = Hy '(q). 

Prove that (15.18) gives a lower bound for (9.83) by direct algebraic methods (refer- 
ence: Söderström and Stoica, 1983. p. 97). 


In case a time-varying norm is allowed, (15.1) takes the Te 
Eléleg(t). OP 


x(&(é, *)) = = {1 
LEE; lelt), t)“ 


Wa 
to 
~= 


(see Problem 9T.1). Recall that 
1 
Eg(e(t).t) = lim — 2 Eg(elt).t) 
Establish that (15.4b) minimizes (15.27) by applying Schwarz’s inequality to the ex- 
pression corresponding to (15.27) with finite N values replacing the limit £. 


Suppose that the innovations have a time-varying distribution, but that the norm ¢(€) 
is constrained to be time invariant. Show that (15.1) is then minimized by 


N 
1 
tie) = — log (im, R 5» Felé, o) 
ne t=i 
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MODEL STRUCTURE 
SELECTION AND MODEL 
VALIDATION 


The choice of an appropriate model structure M is most crucial for a successful 
identification application. This choice must be based both on an understanding of 
the identification procedure and on insights and knowledge about the system to be 
identified. In Chapters 4 and 5 we provided lists of typical model structures to be 
used for identification. In this chapter we shall complement these lists by discussing 
how to arrive at a suitable structure, guided by system knowledge and the collected 
data set. 

Once a model structure has been chosen, the identification procedure provides 
us with a particular model in this structure. This model may be the best available 
one. but the crucial question is whether it is good enough for the intended purpose. 
Testing if a given model is appropriate ts known as model validation. Such techniques, 
which are closely related to the choice of model structure. will also be described in 
this chapter. 


16.1 GENERAL ASPECTS OF THE CHOICE OF MODEL STRUCTURE 


The route to a particular model structure involves. at least. three steps: 


1. To choose the type of model set. (16.1) 
This involves. for example. the selection between nonlinear and linear mod- 
els, between input-output, black-box and physically parametrized state-space 
models, and so on. 

2. To choose the size of the models set. (16.2a) 
This involves issues like selecting the order of a state-space model. the degrees 
of the polynomials in a model like (4.33) or the number of “neurons” in a 
neural network. It also contains the problem of which variables to include in 
the model description. We thus have to select M from a given. increasing chain 
of structures 


M, C Mı C Mz... (16.2b) 
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[Recall the definition of M, C Mh in (4.127).] This problem (16.2) will be 
called the order selection problem. 

3. To choose the model parametrization. (16.3) 
When a model set M* has been decided on (like a state-space model of a given 
order). it remains to parametrize it. that is. to find a suitable model structure 
M whose range equals M* (see Section 4.5). 


In this section we shall discuss basic guidelines for these three steps. We said in 
Chapter 12 that the goal of the user is to “obtain a good model at a low price.” The 
choice of model structure certainly has a considerable effect on both the quality of 
the resulting model and the price for it. 


Quality of the Model 


The quality of the resulting model can. for example, be measured by a mean-square 
error criterion J (D) as in (12.26). where the design variables D include the model 
structure M. (For nonlinear systems and models we could, at least conceptually. 
give analogous formalizations.) In Chapter 12 we found it convenient to split up the 
mean-square error into a bias contribution and a variance contribution: 


J(D) = Jg(D) + Jp(D) (16.4) 


[see (12.26)]. We would thus select ™M so that both bias and variance are kept small. 
These are usually. however. conflicting requirements. To reduce bias one basically 
has to employ larger and more flexible model structures, requiring more parameters. 
Since the variance typically increases with the number of estimated parameters [see 
(9.92}]. the best model structure is thus a trade-off between: 


e Flexibility: Employing model structures that offer good capabilities of de- 
scribing different possible systems. Flexibility can be obtained either by using 


many parameters or by placing them in “strategic positions.” (16.5) 
e Parsimony: Notto use unnecessarily many parameters: to be “parsimonious” 
with the model parametrization. (16.6) 


This trade-off can be formalized objectively as a minimization of (16.4) with respect 
to the model structures. 


Price of the Model 


The price of the model is associated with the effort to calculate it. that is, to perform 
the minimization in (7.155) or to solve the equation (7.156). This work is highly 
dependent on the model structure, which influences: 


e The algorithm complexity: We saw in Chapter 10 that solving for Oy involves 
evaluation of the prediction errors e(t.6) and their gradients w(t.@) for a 
number of 0. The work associated with these evaluations depends critically on 
M. (16.7) 


e The properties of the criterion function: The amount of work to solve for 9x 
also depends on how many evaluations of the criterion function and its gradient 
are necessary. This is determined by the “shape” of the criterion function 
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(7.155) or (7.156): nonunique minimum, possible undesired local solutions. 
and so on. The “shape” in turn is a result of the choice of €(-) and of how the 
€(f,@) depend on 6 (i.e., the model structure). (16.8) 


There is also a price associated with the use of the model. A high-order complex 
model is more difficult to use for simulation and control design. Ifit is only marginally 
better [in the sense of (16.4)] than a simpler model, it may not be worth the higher 
price. Consequently, also 


e The intended use of the model (16.9) 
will affect the choice of model structure. 


General Considerations 


The final choice of model structure will be a compromise between the listed aspects 
(16.5) to (16.9). The techniques and considerations that are used when evaluating 
these aspects can be split into different categories: 


e A priori considerations: Certain aspects are independent of the data set Z“ 
and can be evaluated a priori. before the data have been measured. We shall 
discuss these in Section 16.2. 


è Techniques based on preliminary data analysis: With the data available. cer- 
tain testing and evaluation of Z can be carried out that give insights into 
possible and suitable model structures. These techniques do not necessarily re- 
quire the computation of a complete model. Section 16.3 contains a discussion 
of such preliminary data analysis. 


e Comparing different model structures: Before a final model structure is cho- 
sen, it is advisable to shop around in different model structures and compare 
quality and prices of the models offered there. This will require the compu- 
tation and comparison of several models; such procedures are described in 
Section 16.4. 


è Validation of a given model: Regardless of how a given model is obtained, 
we can always use Z^ to evaluate whether it seems likely that it will serve its 
purpose. If a certain model is accepted, we have also implicitly approved the 
choice of the underlying model structure. Such model-validation techniques 
are described in Sections 16.5 and 16.6. 


16.2 A PRIOR! CONSIDERATIONS 
Type of Model 


The choice of which type of model to use is quite subjective and involves several issues 
that are independent of the data set Z™. It is usually the result of a compromise 
between the aspects listed previously. combined with more irrational factors like the 
availability of computer programs and familiarity with certain models. Let us briefly 
comment on the rational issues involved in the choice. 


494 


Chap. 16 Model Structure Selection and Mode} Validation 


The compromise between parsimony and flexibility is at the heart of the iden- 
tification problem. How shall we obtain a good fit to data with few parameters? ‘The 
answer usually is to use a priori Knowledge about the system. intuition, and ingenu- 
ity. These facts stress that identification can hardly be brought into a fully automated 
procedure. The problem of minimizing (16.4) thus favors physically parametrized 
models. It will depend on our insight and understanding of the process whether it 
is feasible to build a well-founded physically parametrized model structure. This js 
of course an application-dependent problem. 

For a physical system, a priori information can typically best be incorporated 
into a continuous-time model such as (4.62). This means that the computation of 
€(t.@) and the minimization of (7.155) become a laborious task both regarding the 
programming effort and the computation time required. Aspects of algorithmic 
complexity as well as the shape of the criterion function therefore favor black-box 
models. By this we mean a model like (4.33) that adapts its parameters to data, 
without imposing any physical interpretation of their values. 

A general advice is to “try simple things first.” One should go into sophisti- 
cated model structures only if simpler ones do not pass the model-validation tests. 
Especially linear regression models like (4.12) lead to simple and robust minimiza- 
tion schemes (the least-squares method: see Section 7.3). They are therefore often 
a good first choice for an identification problem. 

One should note that using physical a prior) knowledge does not necessarily 
mean that fancy continuous-time model structures have to be constructed. Some 
thinking about the nature of the relationships between the measured signals can 
give good hints for model structures. This was illustrated in Example 5.1. where a 
“semiphysical” model structure of linear regression type was obtained from fairly 
simple a priori considerations. In general, one should contemplate whether xon- 
linear transformations of data [such as {5.20} or a logarithmic transformation] will 
make it easier for the transformed data to fit a linear model. Note in particular 
that nonlinear effects in actuators and sensors may be known and may be used for 
helpful redefinitions of the input-output signals. See Kashyap and Rao (1976). Box 
and Cox (1964). Daniel and Wood (1980), or Carroll and Ruppert (1988)for further 
discussion on data transformations. 


Model Order 


Solving problem (16.2) usually requires help from the data. However. physical insight 
and the intended model application will often tell which range of model orders should 
be considered. Also. even when the data have not been evaluated. knowing N and 
the data quality will indicate how many parameters it is feasible to estimate. With 
few data points, it is not reasonable to try to determine a model in a complex model 
structure. 

A related problem is how many different time scales it is feasible to let one and 
the same model handle. Problem 13G.1 indicated that. for numerical reasons. it may 
be difficult to adequately describe more than two to three decades of the frequency 
range within one model. Considerations on sampling rates. proper excitation. and 
data record lengths strongly suggest that one should not aim at covering more than 
three decades of time constants in one experiment. If the system is stiff so that 
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it contains widely separated time constants of interest, the conclusion thus is to 
build two (or more) models. each covering a proper part of the frequency range 
and each sampled with a corresponding. suitable sampling interval. For a high- 
frequency model, the low-frequency dynamics for all practical purposes look like 
integrators (the number being equal to the pole excess at low frequencies): think 
of a Bode diagram representation. These should thus be postulated for the high- 
frequency model. Correspondingly. the high-frequency dynamics look like static 
(instantaneous) relationships to the low-frequency model. Thus introduce a no-delay 
term bou(t) in this model. 


Model Parametrization 


The issue of model parametrization is basically numerical. We seek model para- 
metrizations that are well conditioned so that a round-off or other numerical error 
in One parameter has a small influence on the input-output behavior of the model. 
This is a problem that has been widely recognized in the digital filtering area, but less 
so in the identification literature. In fact. the standard input-output model structures 
like (4.7) to (4.33) could be quite sensitive to numerical errors. See Problem 16E.1 
for an example. (Compare also with Problem 13G.1.) The choice of parametrization 
of a linear model essentially amounts to picking a certain state-space representation, 
The difference equation models correspond to the observability canonical form of 
Example 4.2. Other choices of state variables, such as wave-digital filters or lad- 
der/lattice filters (cf. Section 10.1) give better conditioned parametrizations. See 
Mullis and Roberts (1976)or Oppenheim and Schafer (1975)for a discussion of this 
issue. Middleton and Goodwin (1990)has advocated parametrizations in terms of 


ô=1- q! 


rather than q7} to cope with this problem. See also Gevers and Li (1993)for a 
thorough discussion of parameterization issues. 


16.3 MODEL STRUCTURE SELECTION BASED ON PRELIMINARY DATA 
ANALYSIS 


By preliminary data analysis. we mean calculations that do not involve the deter- 
mination of a complete model of the system. Such analysis could prove helpful for 
finding suitable model structures. 


Estimating the Type of Model 


Generally. data-aided model structure selection appears to be an underdeveloped 
field. An exception is the order determination in linear structures, to be discussed 
shortly. It is conceivable that various nonparametric techniques could be helpful to 
find out suitable nonlinear transformations of data. as well as to indicate what type 
of dependences between measured variables should be considered. In the statistical 
literature. such procedures are discussed (e.g..in Daniel and Wood, 1980, and Parzen. 
1985). but they have not really found their way into system identification applications 
yet. 
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A particular problem forms an exception: to test for nonlinear effects. That is, 
is it likely that the data can be explained by a linear relationship or will a nonlinear 
model structure be required? Such tests are based on the relationships between 
higher (than second) order correlations and spectra that follow from linear descrip- 
tions. See Billings and Voon (1983), Rajbman (1981). Haber (1985). Tong (1990), 
and Varlaki, Terdik, and Lototsky (1985)for further details. 


Order Estimation 


The order of a linear system can be estimated in many different ways. Methods that 
are based on preliminary data analysis fall into the following categories. 


1. Examining the spectral analysis estimate of the transfer function 
2. Testing ranks in sample covariance matrices 

3. Correlating variables 

4. Examining the information matrix 


We shall give a brief account of each of these approaches. 


1. Spectral analysis estimate: A nonparametric estimate of the transfer func- 
tion Gy lei”) as in (6.82) will give valuable information about resonance peaks and 
the high-frequency roll-off and phase shift. AH this gives a hint as to what model 
orders will be required to give an adequate description of the (interesting part of 
the) dynamics. Note. though. that discrete-time Bode plots show some artifacts in 
their interpretation in terms of poles and zeros, compared to continuous-time Bode 
plots. Thus, use the observations with some care. 


2. Testing ranks in covariance matrices: Suppose that the true system is de- 
scribed by 


y(t) + ant — 1) +++ + anyit — n) f 
= butt —1)+ -+ brult — n) + volt) (16.10) 


for some noise sequence {vo(t)}. Suppose also that n is the smallest number for 
which this holds (“n is the true order”). As usual, let 


p(t) = [-v(t — 1)... —xy(t — s) ult — 1)...u(t — s)] (16.113 
Suppose first that vo(t) = 0. Then (16.10) implies that the matrix 


N 
em , 
R'(N) = T $ AOAO (16.12) 
t=] 


will be nonsingular for s < n (provided {u#(r)} is persistently exciting) and singular 
fors > n+1. x(s) = det R°(N) could thus be used as a test quantity for the 
model order. This was first suggested bv Woodside (1971). The relationship between 
singularity of (16.12) and the corresponding model order, however. goes back to Lee 
(1964)and the realization algorithm by Ho and Kalman (1966). (See also Lemma 
4A.1.) 
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In case a noise {vy(t)} is present in (16.10). (16.12) can still be used, with a 
suitable threshold. provided the signal-to-noise ratio is high. If this is not the case. 
Woodside (1971 )suggested the use of the “enhanced” matrix 


R°(N) = R'(N) — GR, (16.13) 


where 67 R, is the estimated influence of vp{t) on R,(N). 
A better alternative. when the influence of v(t) is not negligible, is to use other 
correlation vectors. If {ug(t)} and {«(z)} are uncorrelated. we could use 


f(t) = [u(r — 1) u(t — 2)...u(¢ — 2s) (16.14) 


and find that 
Ri(N) = Eg,(t)¢7 (1) (16.15) 


is nonsingular for $s < n and singular for s > n + 1 [cf. the discussion on consistency 
of the IV method. (8.98)]. Replacing E by sample mean then gives a usable test 
quantity. If {vo(t)} is known to be a moving average of order r.so that v(t — r — 1) 
and uy(t) are uncorrelated, we could also use 


s(t) = g(t — r) (16.16) 


or any combination of such correlators with (16.14). This order-determination test 
has been discussed by Wellstead (1978)and Wellstead and Rojas (1982), and was 
apparently first described for multivariable structures in Tse and Weinert (1975). 


3. Correlating variables: The order-determination problem (16.2b) is whether 
to include one more variable in a model structure or not. This variable could be 
v(t —n — 1) in (16.10) (a true order-determination problem) or a measured possible 
disturbance variable w(t). In any case, the question is whether this new variable has 
anything to contribute when explaining the output variable y(t). This is measured by 
the correlation between v(t) and w(t). However, to discount the possible relation- 
ship between y(t) and u(t), already accounted for by the smaller model structure. 
the correlation should be measured between u(t) and what remains to be explained 
[i.e.. the residuals e(7. Ôx) = xy(t)— GÊN )]. This is known as canonical correlation 
or partial correlation in regression analysis (see Draper and Smith. 1981). See also 
the discussion in Section 16.6. 

We may also note that the determination of the state-space model order, i.e., 
determining how many of the singular values in (10.127) are significant (or step 2 of 
(7.66), is a test of the same kind. 


4. The information matrix: It follows from Theorem 4.1 that. if the model 
orders are overestimated in certain mode! structures, global and local identifiability 
will be lost. This means that w(t. @) will not have full rank at 0 = 6* (the limit 
value), and hence the information matrix (7.89) will be singular. Since the Gauss- 
Newton search algorithm uses the inverse of the information matrix. a natural test 
quantitv for whether the model order is too high will be the conditioning number of 
this matrix. See Young. Jakeman. and McMurtrie (1980), Mehra (1974), Söderström 
(1975a). and Stoica and Söderström (1982c). 
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A related situation occurs when the IV method is used. Then the matrix 
N 


1 
RN) = z È OvD 


i=l 


in (7.118) will be singular when the orders are overestimated, as we found in (8.98) 
and under paragraph 2. Testing the conditioning of this matrix is thus naturally 
incorporated in the IV approach. 


Multivariable Case: Model Parametrization (*) 


The black-box multivariable parametrization problem is to select the multi-index 
Ta in (4A.16) to (44.18). Some different methods for this have been discussed in 
the literature. The observability indexes {o;}, defined in the proof of Lemma 4.2, 
form one possible choice for ¥,,. The indexes {o;} are defined by the rank structure 
of Hnp+ p- which in turn, in the noise-free case, is related to the rank structure of 
R°(N) in (16.12). Guidorzi (1975)has suggested the use of R*(N) to determine the 
observability indexes and, in the noise-corrupted one. the analog of the “enhanced” 
matrix (16.13). Tse and Weinert (1975)use instead an estimate of the matrix (16.15) 
and (16.16) (in the input-free case) for the same purpose. 

An alternate route has been considered by van Overbeek and Ljung (1982). 
They use the overlapping model structure (4A.33) and switch, during the criterion 
minimization, from one parametrization to another when the information matrix is 
ill-conditioned. They also link the conditioning of this matrix to the conditioning of 
the state covariance matrix. 

Other non-canonical parameterizations, like a tridiagonal form and a full 
parameterization. have been discussed in McKelvey and Helmersson (1996)and 
McKelvey (1994). The use of balanced realization parameterizations is described 
in Ober (1987)and Hanzon and Ober (1997). 


16.4 COMPARING MODEL STRUCTURES 


A most natura] approach to search for a suitable model structure is simply to test a 
number of different ones and to compare the resulting models. In this section we 
shall discuss what to compare and how to evaluate the comparisons. The model to 
be evaluated will generically be denoted by m = M(6x). Itis estimated within the 
model structure M, which is supposed to have dm = dim@ free parameters. By the 
Estimation Data we mean the data that were used to estimate m, while Validation 
Data will denote any data set available that has not been used to build any of the 
models we would like to evaluate. 


What to Compare? 


There are of course a number of ways to evaluate a model. We shall here describe 
evaluations and comparisons that are based on data sets from the system. Generally 
speaking, the tests should bring out the relevant features for the intended model 
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application, so it is desirable that these data sets have been collected under conditions 
that are close to the intended operating conditions. The model tests are then basically 
tests of how well the model is capable of reproducing these data. 

We shall generally work with k-step ahead model predictions \;(t|m) as the 
basis of the comparisons. By that we mean that $z (t |m) is computed from past data 


u(t — 1)...., 401), x(t — k)... vO) (16.17) 


using the model m. The case when k equals œ corresponds to the use of past 
inputs only. i.e.. a pure simulation. We use the notation ¥(f|m) = $, (r|m) for this 
case. Similarly we introduce fi (t|m) = ¥,(t|m) for the standard one-step ahead 


predictor. For a linear model y = Gu + He we thus have. according to Chapter 3. 


$(t|m) = ipul) (16.18) 


Stim) = An'(g)G(qyutt) + (1 = Aa) y(t) (16.18b) 


salm) = Welg)G(qyult) + (1 = Ma) y(t) (16.18¢) 


with W, determined as in (3.29). For an output error model, H(g) = 1, there 
is clearly no difference between the expressions in (16.18). Otherwise. note the 
considerable conceptual difference between f, and Vp The latter has v(t — 1) and 
earlier y-values available and can therefore give fits that “look good.” even though 
the model may be bad. 


Example 16.1 A Trivial Model 
Consider the model 


m: Fle) = y(t — 1) 


It will predict the next output to be the previous one. For a data record that is sampled 
fast (like the one in Figure 14.1a), ¥,(t|77) will be practically indistinguishable from 
y(t). On the other hand, ¥,(t|7) = 0, so the model is useless for simulation. J 


For the general model (5.66) the simulated output is defined recursively as 


alm) = g(t. Zi! by) (16.19) 


Zit = {i — I]m).u(t — 1). $t — 2m) u(t — 2)..... 9.1m). u() 


For control applications, the predicted output over a time span that corresponds to 
the dominating time constant will be an adequate variable to look at. The simulated 
output may be more revealing, since it is a more demanding task to reproduce the 
output from input only. For an unstable model. we clearly have to use predictions. 
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Now, the models can either be evaluated by visual inspection of plots of vrr) 
and \;(t|m), or by the numerical value 


N 
1 A 2 
J (m) = nL — 3, (r/m)| (16.20) 


We will also use the notation J, = J; and J; = Jo. It is useful to give some 

normalized measure of this fit. Assume that y has been detrended to zero mean and 
define hm) 
(1m 

Rais ISN a 


Then R is that part of the output variation that is explained by the model, and ts 
often expressed in %. See also (11.38). 

The quality measure J; (Mm) will depend on the actual data record for which 
the comparison is made. It is therefore natural also to consider the expected value 
of this measure. where expectation is taken with respect to the data. regarding the 
model as a fixed. deterministic quantity: 


(16.21) 


Jı(m) = EJ (m) (16.22 


This gives a quality measure for the given model. Now, m = M(ĝy) is itself a 
random variable. being estimated from noisy data. The expectation of the model fit 
with respect to Oy gives a quality measure for the model structure M: 


Ti(M) = EJi(MÔx)) (16.23) 


It should be noted that for linear regression models, the measure Jp (mM) can be 
computed for many models simultaneously. The requirement is only that the models 
are obtained by deleting trailing regressors. This follows frgm (10.11). which shows 
that the norm of the k:th row of R, gives the increase J,(71) — Jp(M2) when the 
k:th parameter is removed from the model structure. 


Comparing Models on Fresh Data Sets: Cross- Validation 


It is not so surprising that a mode] will be able to reproduce the estimation data. The 
real test is whether it will be capable of also describing fresh data sets from the process. 
A suggestive and attractive way of comparing two different models m; and mz is to 
evaluate their performance on validation data, e.g.. by computing J,(™m;) in (16.20). 
We would then favor that model that shows the better performance. Such procedures 
are known as cross-validation and several variants have been developed. See. for 
example, Stone (1974)and Snee (1977). An attractive feature of cross-validation 
procedures is their pragmatic character: the comparison makes sense without anv 
probabilistic arguments and without any assumptions about the true system. Their 
only disadvantage is that we have to save a fresh data set for the validation, and 
therefore cannot use all our information to build the models. 

For linear regressions. we can use J (mM) in the following way: Let 5(t|1,) 
be computed for a model m, which is estimated from all data except the observa- 
tion (y(t). g(t)). within the model structure M. Form J by summing over all the 
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corresponding squared errors. Then J is a measure of the predictive power of this 
model structure in a cross-validation sense, vet no data have been “wasted” in the 
estimation phase. The procedure is known as PRESS (Prediction sum of squares): 
see Allen (1971 )and Draper and Smith (1981). Section 6.8. For dynamic systems that 
do not have predictors with finite impulse responses, this is less easy to accomplish. 


Comparing Models on Second-hand Data Sets: Evaluating the 
Expected Fit 


The proper quality measure for the model m is the expected criterion J; in (16.22). 
If the model is evaluated on validation data, the observation J; is a reasonable and 
unbiased estimate of J4. This is why model evaluation on validation data is to be 
preferred. 

If we use estimation data for the comparisons, then J; is no longer an unbiased 
estimate of J}. In this section we shall discuss the nature of this discrepancv in case 
the comparison criterion coincides with the estimation criterion. This means that the 
value J, will equal the value of the identification criterion: 


N y 
1 - Ay A 2 . l k a A 2 
J (m) = N 2 Iyer = da) = min W 2 |x) — $(r|)| (16.24) 


A Pragmatic Preview. The model obtained in the larger model structure will auto- 
matically yield a smaller value of the criterion of fit. since it is the minimizing value 
obtained by minimization over a larger set. As the model structure increases. as in 
(16.2b). the minimal value of the criterion will thus behave as depicted in Figure 16.1: 
it is a monotonically decreasing function of the model structure flexibility. To begin 
with, the value Vy decreases since the model picks up more of the relevant features 
of the data. But even after a model structure has been reached that allows a cor- 
tect description of the system, the value V continues to decrease, now because the 
additional (unnecessary) parameters adjust themselves to features of the particular 
realization of the noise. This is known as overfit and this extra improved fit ts of 
course of no value to us, since we are going to apply the model to data with differ- 
ent noise realizations. It is reasonable that the decrease from overfit should be less 
significant than the decrease that results when more relevant features are included 
in the model. We will thus be looking for the “knee” in the curve of Figure 16.1. 
Indeed, it is good practice to plot this curve to get a subjective opinion on whether 
the improved fit is significant and worthwhile. 


A Formal Result. For the case that the the comparison criterion coincides with the 
estimation criterion. we have the following result: 


Theorem 16.1. Let 


Ny 
; i 
= : k Ny — : 
Oy = arg min Va(@é.Z°) = arg min N ) €(e(t.8).0.t) 


t=] 


V(9) = E€(e(t.6).0.t) = lim EVy(6. Z“) 
v7 OS 
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Vy 


M, Mg Mo Mua 
Figure 16.1 The minimal value of the loss function as a function of the size of the 
model structures (16.2b}. Vy = min Vy (0). 
Let 6* be the minimizing argument of V(@) and suppose 
EN(6y — 6*)(6n — 0°)’ > Paas N > x 
Then, asymptotically, as N — oc 


— A a 1 — 
EV (Ôn) © EVwy(Oy.Z%) + yuv OP (16.25) 


with expectation over the random variable Êy. 
Proof. Expand V(@) around 6*: 
Vibv) = Ver) + Ôn — 0) V" (tn) Gn — 0%) (16.26) 
Similarly, since V4 (Êy, z^) =0, 
Vy (Ôn, Z“) = Vn (0*. ZY) — Ly — 0) VEG n. ZÊ —0®) (16.27) 


Take expectation of these two expressions and use the following asymptotic relation- 
ships 


E3(6y — 0°)" V"(tw)(On — 6*) 
= Et [V EÂ -Âr — 67)" | © SeV"6*)Py (16.28) 


where Py = (1/N)Po is the asymptotic covariance matrix of Ôn. Note that Py 
decays as 1/N. Also, 
EL(6y — O°) Vu (En. Z* On — 0%) ~ 5trV"(0") Pry (16.29) 


and = 
EVy(6*,Z%) = V(@") 
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This gives, from (16.26) and (16.27), respectively, 


EV (Ôx) © V(O*) + dtrV"(6*) Py (16.30) 
EVy(6n.Z%) © V(6") — LtrV"(6") Py (16.30b) 
which concludes the proof. gJ 


Note the important difference between EV (y) and EV, (Ôx). If we gener- 
ate many estimation and validation data sets in a Monte-Carlo manner. the second 
measure would be the averages of the fits of the models as they are fitted to estima- 
tion data, while the first one is the average as the estimated models are evaluated 
on validation data. Clearly, it is this value that is the one to consider for validation 
purposes. 


Akaike’s Final Prediction-Error Criterion (FPE). Let us now specialize to 
é (e(t.0).0,t) = £°(t, 0). Then we have 


J (M) = EV(6y) 
Let us also assume that 


e The true system is described by 6* = 0p, (i.e.. S € M) 
e The parameters are identifiable so that V” (0) is invertible (16.31) 
e The validation data have the same second order properties as the 

estimation data 


Then we can use Theorem 16.1 to determine a suitable estimate of Jp that can 
be formed from estimation data only: We know from (9.17) that (note that V” = 
2Eww’ in our case) 


Py = 2o [V" 0]. where Ao = Eel) = V() (16.32) 


The last of the assumptions (16.31) means that this function V(6) is the same as the 
one in (16.25). We can consequently use this expression for Pg in (16.25) which gives 


7 Tres A : 2A We Were mE 
J (M) = EV(6y) ~ Vylôn, Z”) + tly ()[V"(@o)}-"] 


R 2d 
Vu (On, Zz") + no 


In the first step, we replaced E Vy (Oy, Z^) with the only observation we have of it, 
viz. Vv (Ox. Z). In the last step we used that 


tr {V"G) [Vre] = dime™ = dy 
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The expression 


== — a A t 2d. 
T (M) = EV x (Oy) © VylÊn. ZY) + hE 


shows the fundamental cost of parameters. The more parameters are used by the 
model structure. the smaller the first term will be. However, each parameter carries a 
variance penalty that will contribute with 2A9/N to the expected mean square error 
fit. This is true regardless of the importance of the parameter for the fit. i.e.. how 
much it reduces the first term. Any parameter that improves the fit of Vy by less 
than 2Ay/N will thus be harmful in this respect. 


Now, Ao is not known. but can easily be estimated. According to (16.30) (ig- 
noring the first expectation): 


eee ws - PIE 
Vv (Oy. Z“) ~ V(6) — HV") Py ~ dg —- 


A suitable estimate for Ao is thus obtained as 


2o _ Vion. Z^) 
"T= (dm/N) 


which, inserted into (16.33), gives 


1 + (dm/N) 
1 — (dm/N) 


Vy Gy, Z“) l 


J (M) x 


_ 1+ du/N1< 


eae Re! 2t, Ô 16.34 
1 (dm/N) N 2° | N) ( ) 


This criterion was first described by Akaike (1969)as a final prediction-error (FPE) 
criterion. It shows how to modify the loss function to get a reasonable estimate of 
the validation and comparison criterion J from estimation data only. The observed 
fit must be compensated for using the number of estimated parameters to give a fair 
picture of the model quality. 


A Note on Regularization. Suppose that we are using the regularized criterion 
(7.107): 


x 
„1 ; EA 
Wr (0, Z^) = F X £ (elt, 0)) +810 — O°? = Vy (8. Z”) +519 — 0*I? (16.35) 


t=1 
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Assume also that the true system is in the model set. that (16.31) holds and that 
9* = 6. The latter assumption is of course not quite realistic. and we shall comment 
on it below. The calculations in Chapter 9 can directly be applied to (16.35), to give 


Tape -l = -1 rri T” X OPH 
Py = [w (60) o[w (60) . WG) = VG) + 8. O = AV") 


See (9.9)-(9.11). Inserted into the equation leading to (16.33), the trace-term be- 
comes (H = V"(8)) 


day > 


- 


g: 
trH(H + 61)'H(H +D = X —+ 
k=1 (o; + ô)- 


where o; are the eigenvalues (singular values) of V” (6a). This follows since H and 
H + ôI can be simultaneously diagonalized. This means that instead of (16.33). the 
regularized criterion leads to 


dy 
7 VW A fa ; 2A 
Jp M) = EVON) © Vy (Oy. Z“) + Zy 


— 16.36 
N = (o; + 6)? ( ) 


We note that with 6 = 0 the sum will equal dm. so the special case (16.33) is then 
re-obtained. Typically, the singular values of V” (6o) are widely spread so that either 
o; X ô or oj > 4. which means that the sum really just counts the number of 
singular values that are larger then 6. We could think of this as the efficient number 
of parameters used by the model structure with the regularized criterion. The other 
parameters can be thought of as locked in by the regularization. Comparing with 
(16.33) we may regard the regularization parameter 6 as a knob by which we control 
the number of free parameters in the mode] structure. without having to decide which 
ones to set free. The criterion will then let M use those that have the largest influence 
on the fit. 

The expression (16.36) has been derived under the (unrealistic) assumption 
6* = 6. In the common case that 6* = 0. the regularization will cause a small 
bias in the parameters. due to the pull towards the origin. This is similar to the bias 
introduced by using too few parameters, so the bias-variance trade-off in terms of 
the regularization parameter ô is still analogous to the one obtained by explicitly 
controlling the number of parameters. 


Model Structure Selection Criteria: AIC, BIC, and MDL 


Suppose that the prediction-error criterion is chosen as the normalized log-likelihood 
function [see (7.84)]: 


Vy (0. Z“) 


1 
“Wy (log likelihood for the estimation problem) 


1 
= ——Lx(0, Z“ 16.37 
N x(O, ) ( ) 
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Then provided that (16.31) holds we know from the asymptotic distribution result of 
Section 9.3 that 


er -l 1l s = 
Py = [V] > = [ELROD] (16.38) 
[see (7.80). (7.89), and (9.29)]. This inserted into (16.25) gives the criterion 
—— — a 1 Pie. 7 da 
E Vyn) = -—Ly (ôx. Z*) + — 3 
nv (On) N N (On )+ T (16.39) 


since _ = 
tr [V0 [VG] | = dima™ = dw 


This expression is the Akaike AIC criterion (7.106). We can thus phrase the joint 
problem of model structure determination and parameter estimation as 


ARAS 1 : 
103. M} = arg min min —[-Ly(@™.Z*) + da] (16.40) 


Me M ge Dari N 


where Ly is the log-likelihood function and superscript M denotes that @™ is asso- 
ciated with the model structure M. 


Example 16.2 AIC for Gaussian Innovations 
Suppose that the process innovations are Gaussian with unknown variance à. Then 
N 


1 
N zZ žá ae — 
LK. Z) = 5 ) 


t=] 


? t 
“(2,6 N N 
à A La = logan — 5 log 2x 


where 6 = [8". A] [see (7.87) and (7.17)]. For the inner minimization (within a given 
model structure) in (16.40}, we have (see Problem 7E.7) 


6y = [Ay dw] 


N 
1 5 T 
N E(t. Oy) 


t=) 


~>» 
z 
ii 


N 
By: = arg min > e7(t.@) 


HE Dat i=l 
H > 
e Ly (Âx. Z“) A a 
NON. = —-—- —- —10 — — 102 27 
N N 2 2 BAN 2 £ 


and the outer minimization (w.r.t M) in (16.40) takes the form 


N 
—_ z i 1 1 1 : 7 AM dx; 
M= - ~log2x + -i — -(t, 7 + — 
gnin [$ + Sinan DD 3 N 


MEM 
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The term to minimize is 


is 2day\ 3 ad 2. AAT 
siei k= NER) (16.41) 


where the last approximation follows when dim 0™ < N. Note the similarity with 
(16.34). oO 


The criteria that we have discussed here may. from a more pragmatic point of 
view, be seen as joint criteria for the determination of model structure and parameter 
values within the structure. Conceptually, they could be written 


WR. M. Z) = Vu(@. ZO + Un(M)) (16.42) 


where Vx is the prediction-error criterion (7.155) within certain model structures M, 
and Ux(M) is a function that measures some “complexity” of the model structure. 
In the cases so far, this measure has been related to the dimensionality of 8: 


2dim@ 


N 


These criteria have been directed to find system descriptions that give the smallest 
mean-square error. A model that apparently gives a smaller mean-square (predic- 
tion) error fit will be chosen even if it is quite complex. In practice. one may want 
to add extra penalty in (16.43) for model complexity, reflecting the cost of using it: 
“If I am going to accept a more complex model (according to my own complexity 
measure) it has to prove to be significantly better!” 

What is meant by a complex model and what penalty should be associated with 
it are usually subjective issues. An interesting approach to this problem is taken by 
Rissanen (1978). He asserts that the ultimate goal of identification is to achieve the 
shortest possible description of data. This leads to a criterion of the type (16.42) with 


log N 
Uy(M) = dimé - — (16.44) 


Uy(M) = (16.43) 


called the MDL (minimum description length) criterion. This has also been termed 
“BIC” by Akaike. 


Statistical Hypothesis Tests (+) 


The selection between two model structures M, and M, subject to (16.2b), can also 
be approached by the theory of statistical tests. The idea is to pose the hypothesis 


Ho : the data have been generated by M (6) ) (16.45) 
This hypothesis is to be tested against the alternative 


Hı : the data have been generated by Mı (Oy) (16.46) 
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If now M is “larger” than My. we should introduce a prejudice against Mi. This 
means that we would prefer the set My unless there is “convincing evidence” that 
the hypothesis H; is true. In statistics this is expressed as Hy is the “null hypothesis.” 

We should now like to make a decision between Hy and H; such that the risk 
(the probability) of rejecting Hy when it is true is less than a certain number w. |The 
smaller @ is chosen the more prejudice is inflicted against M,.) At the same time. 
we would like to maximize the probability that Ho is rejected when H; is true. The 
latter quantity is known as the “power” of the test. 


Let 6.” denote the limiting estimate in the model structure M,. If this value 
gives a correct description of the system (S € M,), it can be shown under general 
conditions that 


Vy (0S. ZY) — vy"). Z“) 


N 
Vy (Oe). ZY) 


€ Asx*(d(k)) (16.47) 


That is, the random variable on the left side converges in distribution to the y- 

distribution with d(k)(= dim 6) degrees freedom. This is proved in Lemma 11.4 

for the case of linear regressions and by Åström and Bohlin (1965)for ARMAX 

models. Notice that (16.47) is consistent with the expression (16.30b) in case S € Af; 
Under the null hypothesis (16.45), it follows from (16.47) that 


Vv Oy, Zy) — Vn (Oy. Zw) 


N Ai) ' 
Vy (Oy ZN) 


e Asx7(d(1) — d(0)) (16.48) 


The null hypothesis can thus be tested at any desired confidence level œ using this 
expression. 

In case Vy is chosen as the log-likelihood function, es also becomes the 
likelihood ratio (LR) test (see. e.g., Kendall and Stuart, 1961). which has maximum 
power. Bohlin (1978)has derived a maximum power test that does not require the 
computation of the model in the large set. 

Using (16.48) amounts to rejecting Hy (and hence choosing the structure M- } 
if 

Vy (On, ZY) — Vy (Oy. ZN) > Vu (Oy. ZY) - = - kala) (16.49) 
where ka (@) is the æ level for the x? distribution with d(1)—d (0) degrees of freedom. 
From a user`s point of view. (16.49) thus coincides with the use of (16.41) (AIC) or 
(16.34) (FPE), ford < N, with œ such that 


kafa) = 2(d(1) — d(0)) (16.50) 


This is satisfied for a somewhere between 7% and 1% depending on the difference 
d(1) — d(Q) (within reasonable ranges). There is consequently a clear relationship 
between FPE, AIC, and hypothesis testing in practical use. See Söderström (1977 )for 
further details. 
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For small N, a more accurate expression than (16.48) is 


AiO) 


Vv (Oy ZY) — VwOy'.Z%) N -d 
Vy (Oy. ZY) d(1) — d(0) 
€ AsF(N — d(1),d(1) — d(0)) (16.51) 


(cf. Lemma II.4). Based on this asymptotic F -distribution, F -tests can be applied. 


16.5 MODEL VALIDATION 


The parameter estimation procedure picks out the “best” model within the chosen 
model structure. The crucial question then is whether this “best” model is “good 
enough.” This is the problem of model validation. The question has several aspects: 


1. Does the model agree sufficiently well with the observed data? 
2. Is the model good enough for my purpose? 
3. Does the model describe the “true system”? 


Generally (Bohlin. 1991). the method to answer these questions is to confront the 


model M (Ôx) with as much information about the true system as is practical. This 
includes a priori knowledge, experiment data, and experience of using the model. In 
an identification application the most natural entity with which to confront the model 
is the data themselves. Model-validation techniques thus tend to focus on question 1. 
We shall in this section list a number of tools that are useful for discarding models, as 
well as for developing confidence in them. A particularly useful technique, residual 
analysis. is treated separately in the following section. 


Validation with Respect to the Purpose of the Modeling 


While question 3 is intriguing. it is also, philosophically, impossible to answer. What 
matters in practice is question 2. There is always a certain purpose with the modeling. 
It might be that the model is required for regulator design, prediction. or simulation. 
The ultimate validation then is to test whether the problem that motivated the mod- 
eling exercise can be solved using the obtained model. If a regulator based on the 
model gives satisfactory control, then the model was a “valid” one, regardless of the 
formal aspects on this concept that can be raised. Often it will be impossible, costly. 
or dangerous to test all possible models with respect to their intended use. Instead. 
one has to develop confidence in the model in other ways. 


Feasibility of Physical Parameters 


For a model structure that is parametrized in terms of physical parameters, a natural 
and important validation is to confront the estimated values and their estimated 
variances with what is reasonable from prior knowledge. It is also good practice to 
evaluate the sensitivity of the input-output behavior with respect to these parameters 
to check their practical identifiability (this should also be reflected by the estimated 
variances). 
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Consistency of Model Input-Output Behavior 


For black-box models, we focus our interest on their input-output properties. For 
linear models. we wouid normally display these as Bode diagrams. For non-linear 
models, they would be inspected by simulation (see the following). It is always good 
practice to evaluate and compare different linear models in Bode plots. possibly with 


the estimated variance translated to confidence intervals of G (and A). see (9.59), 
Comparisons between spectral analysis estimates (6.46) and Bode plots derived from 
parametric models are especially useful, since they are formed from quite different 
underlying assumptions (i.e., model structures: see Problem 7G.2). 


Generally, when the true system does not belong to the model set, we obtain 
an approximation whose character will depend on the experimental conditions. the 
prefilters used, the criterion, and the model structure (Chapter 8). Thus, comparing 
Bode plots of models obtained by prediction error methods in different structures. 
as well as with different prefilters, by the subspace method and by spectral analvsis 
will give a good feel for whether the essential features of the dynamics have been 
captured. 


Model Reduction 


One procedure that tests if the model is a simple and appropriate system description 
is to apply some model-reduction technique to it. 1f the model order can be reduced 
without affecting the input-output properties very much, then the original model was 
“unnecessarily complex.” Söderström (1975b)has developed this idea for pole-zero 
cancellations. 


Parameter Confidence Intervals 


Another procedure that checks whether the current model ilan too many param- 
eters is to compare the estimate with the corresponding estimated standard deviation 
(see Section 9.6). If the confidence interval contains zero, we could consider whether 
this parameter should be removed. This is usually of interest only when the corre- 
sponding parameter reflects a physical structure, such as model order or time delay. 
If the estimated standard deviations are all large. the information matrix is close to 
singular. This also is an indication of too large model orders (see Section 16.3). 


Simulation and Prediction 


In Section 16.4 we used the models’ ability to reproduce input-output data in terms 
of simulations and predictions as a main tool for comparisons. Such plots, and the nu- 
merical fits associated with them, are of course most useful and intuitively appealing 
also for evaluating a given model. We see exactly what features the model is capable 
of reproducing and what features it has not captured. The discrepancies can be due 
to noise or model errors, and we will see the combined effects of these sources. It 
we have an independent estimate of the noise level. e.g., from (7.154). we will also 
be able to tell from J,(7m) in (16.20) what the size of the model error is. 
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16.6 RESIDUAL ANALYSIS 


The “leftovers” from the modeling process—the part of the data that the model could 
not reproduce—are the residuals 


elt) = e(t.6y) = y(t) — F(tlOy) (16.52) 


It is clear that these bear information about the quality of the model. In this section 
we shall discuss informal and formal methods to draw conclusions about the mode} 
validity from analysis of the residuals. 


Pragmatic Viewpoints 


The bottom line really is that we have a data set Z“. be it estimation or validation 
data, and a nominal model m. We want to know the quality of the model, which in 
a sense is a statement about how it will be able to reproduce new data sets. A simple 
and pragmatic starting point is to compute basic statistics for the residuals from the 
model: 


N 
2 ly 2 
Sı = max le(t)| á Sy = W 3 e*(t) (16.53) 


The intuitive use of these statistics would then be like this: “This model has never 
produced a larger residual than S, (or an average error of $z) for all data we have 
seen. It’s likely that such a bound will hold also for future data.” Indeed, the 
rationale of the different identification criteria could be said to allow for as strong 
such statements as possible. Now. this use of the statistics (16.53) has an implicit 
invariance assumption: The residuals do not depend on something that is likely to 
change. Of special importance is. of course, that they do not depend on the particular 
input used in Z. If they did, the value of (16.53) would be limited, since the model 
should work for a range of possible inputs. To check this, it is reasonable to study 
the covariance between residuals and past inputs: 


7 

A r 1 _ 

RX (t) = a X eult — t) (16.54) 
i=] 


If these numbers are small (and we shall shortly quantify what that should mean) 
we have some reason to believe that the measures (16.53) could have relevance also 
when the model is applied to other inputs. 

Another way to express the importance of RY being small is as follows: If there 
are traces of past inputs in the residuals, then there is a part of y(t) that originates 
from the past input and that has not been properly picked up by the model m. Hence. 
the model could be improved. 

Similarly, if we find correlation among the residuals themselves. i.e., if the num- 
bers 


N 
RX (t) = Z Lewe —t) (16.55) 
t=1 
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are not small for r Æ O, then part of £{żt) could have been predicted from past 
data. This means that y(t) could have been better predicted, which again is a sign of 
deficiency of the model. 

From a more formal point of view we have motivated the estimation criterion 
as a maximum likelihood method. assuming that the data have been generated as in 
(7.82): 


y(t) = g(t. Z”: Oy) + ett) (16.54) 


where the €(t) have the properties given by (7.81) to be independent of each other 
and past data. The model validation question related to data. then. is “Is it likely 
that the data record Z™ actually has been generated by (16.56)?” This question is 
equivalent to “Is it likely that 


elt) = v(t) — g(t. ZY: by) (16.57) 


is a sequence of independent random variables with PDF f(x, t: Oxy)?" Clearly 
(16.55) and (16.54) form the basis to part of the answer. 


Whiteness Test 


The numbers R“ (r) carry information about whether the residuals can be regarded 
as white. To get an idea of how large these numbers may be if indeed e(z) are white. 
we reason as follows: Suppose {e(r)} is a white noise sequence with zero mean and 
variance À. Then it follows from Lemma 9A.1 that 


N e(t = 1) 
=> e(t) € AsN(O.A° - I) 


= E(t 4 M) i 
The k:th row of this vector is v N R^ (k). Under the assumption that the € are white. 
this consequently means that 


should be asymptotically y*(M)-distributed (cf. Appendix II). Replacing the un- 
known A by the obvious estimate does not change this, asymptotically. (Rightly, the 
distribution becomes an F-distribution, see (11.79). but we anyway assume N to be 
large.) The test for whiteness will thus be if 


ul 2 
NM = — (RS (x)) (16.58) 


a w) E 


will pass a test of being x7(M) distributed. i.e., by checking if N.a < X,(M ), the 
a level of the x~(M)-distribution. 
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In addition to this whiteness test, more tests can be performed, like the number 
of sign changes of £ (t). and histogram test for the distribution of £. See. e.g., Draper 
and Smith (1981), Chapter 3. for details of such tests. 


Independence between Residuals and Past Inputs 


To investigate which requirements should be associated with (16.54). we define 


u(t — Mi) 


N 
] 
= ) eltolt). g(t) = f 
N tl u(t es M3) (16.59) 


Note that the k:th component of r is equal to NRY (k+ Mı —1). given by (16.54). 
If the £ are independent of g and can be written as 


XxX 
eltt) = X felt- k) fo=1, e(t) white noise with Ee*(t) = À (16.60) 
k=0 
then it follows from (9.38)—(9.40) that 


x 
ry € ASNO.AP) P = EGO). G1) =F ple +k) (16.61) 
k=l 
It can be shown that the (k, £) element of P can also be expressed as 


Do RADR (E — (k — ©). Rel) = Eee — t), Rule) = Eu(ue — t) 


=x 


(16.62) 
Now. (16.61) implies that 
1» 
i = rm Pray E€ Asx?(M) (16.63) 


if € is independent of the inputs. Thus, ¢x, y isthe right quantity to subject toa X° (M) 
test. Note that we need to estimate a model (16.60) to be able to form this quantity. 
If € is assumed to be white noise, or has passed the test (16.58), the calculation of 
Sy. is simplified. 

A simple use of (16.54) is to consider just one given r as follows: From the 
calculations above. it follows that 


NRS 


Eu 


ip 
(t) € ASN(O. P). Py = DB R.(K)Ry(K) (16.64) 


k=-X 


If Ng denotes the @ level of the N(0. 1) distribution, we could thus check if 


R“ Pı 
NO| Sf Fre (16.65) 
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If not. the hypothesis that e(t) and u(t — T). are independent should be rejected. 
An appealing way to carry out the test is to plot RN T) as a function of T. Since P} 
in (16.55) does not depend on Tt. the confidence limits will be horizontal lines. Such a 
plot gives valuable insight in the correctness of the model structure. If, for example. 
a time delay of two samples has been assumed in the model, but the true delay ts 
one sample, then a clear correlation between u(t — 1) and e(f) will show up. When 


examining plots of RY, (t). note the following points: 


1. Correlation between u(t — T) and e(f) for negative t is an indication of output 
feedback in the input, not that the model structure is deficient. 


2. The least-squares method constructs Ox so that ¢(f. Gy) is uncorrelated with 
the regressors. We thus have RY (t) = 0 for t = 1,....m, automatically for 


the model structure (4.7}, when the analvsis is carried out on estimation data. 


This means that some care should be exercised when the numbers M, and Ma 
in g(t) are selected. If estimation data are used. together with an ARX model of 
order n,. np. we should thus have M, > na. Similarly. it is natural to take M, > Oif 
only causal dependence from past inputs shall be tested. Note that the levels of the 
tests are somewhat effected by whether estimation or validation data are used. See 
Söderström and Stoica (1990)for an analysis of this. 

Independence between u and £ can also be measured in other terms. Un- 
modeled nonlinear effects can. for example, be seen in scatter plots of the pairs 
(e(t). u(t — t)), or by correlating non-linear transformations of € and u. See, 
for example, Draper and Smith (1981), Chapter 3, Cook and Weisberg (1982). and 
Anscombe and Tukey (1963)for general aspects on residual testing, and Billings 
and Tao (1991). Lee. White. and Granger (1993), and Luukkonen. Saikkonen. and 
Terasvirta (1988)for tests that specifically aim at detecting gonlinearities. 


Tests for Dynamical Systems 


Testing the correlation between past inputs and the residuals is natural to evaluate if 
the model has picked up the essential part of the (linear) dynamics from u to v. Fora 
dynamical model. the results of the tests can however be visualized more effectively 
if we view them as estimates of the residual dynamics or Model Error Modet: 


Eelt) = G,(q)u(t) (16.66) 


In fact. if the input is white. then RN (T) are approximately the components of the 
estimate 0y obtained in the FIR-model 


e(t) = 67 git) 


with ọ given by (16.59). i.e., the impulse response of (16.66). For the case of non- 
white input. it will be easier to evaluate a plot of the impulse response estimate of 
G, than of the correlation estimates RS. since the latter also are affected by the 
internal correlation of u. 
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Even more effective from a control point of view will be to display the frequency 
function of the estimate G. (e'”). along with estimated confidence regions. This gives 
a picture of what frequency ranges the model has not captured in the input-output 
behavior. Depending on the intended use, the model could then be accepted, even 
when (16.65) is violated. provided the errors occur in “harmless” frequency ranges. 


Example 16.3 Residual Analysis for Dynamic Systems 
The system 


x(t) — 1.2¥(f — 1) — 0.15y(t — 1) + 0.35¥(¢ — 3) 
= u(t — 1) + O0.5u(t — 2) + e(t) — eit — 1) + 0.4e(t — 2) 


was simulated over 500 samples with an input consisting of sinusoids between 0.3 
and ().6 rad/sec, and white Gaussian noise e with variance 1. A second order ARX 
model m was identified from these data. A validation data set was generated using 
a random binary input with a resonance peak around 0.3 rad/sec. Figure 16.2a shows 
the result of conventional residual analysis when m was confronted with these data. 
Figure 16.2b shows the impulse and frequency responses of the model-error model 
(16.66), estimated as a 10th order ARX model. It ts clear that the frequency function 
plot of the error model gives much more precise information about the model's 
qualities. from a control point of view. In Figure 16.3, the amplitude Bode plots 
of the model and the true system are compared, and we see that the model error 
information from validation data is quite reliable. o 


Error Model Impulse Response 


Error Model Frequency Response 


-30 -20 -10 0 10 2 30 10 10°! 10° 1 
(a) Conventional residual analysis (b) Model-error model estimated from 
correlation functions. validation data. 


Figure 16.2 Model validation of the second order ARX model using validation data. 
Dash-dotted lines denote confidence intervals. For the frequency domain plot. the confidence 
interval is marked as a shaded region. 
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10? 


10' 


10° 


107 
10° 10°! 10° 10! 
Figure 16.3 Amplitude Bode plots of the model m (solid line) and the true 
system (dashed line). 


Critical Data Evaluation 


The residuals é(r, Ay) will also tell us, inserted into the influence function (15.12), 
which data points have had a large impact on the estimate. The reliability of these 
points should be critically evaluated as part of the model-validation procedure. It is 
always good practice to plot €(¢, Ôn) to inspect the data for outliers and “bad data.” 
Compare Example 14.1. 


16.7 SUMMARY 


The “true system” is an esoteric entity that cannot be attained in practical modeling. 
We have to be content with partial descriptions that are pupposeful for our applica- 
tions. Sometimes this means that we may have to work with several models of the 
same system that are to be used for different operating points. for different time scale 
problems. and so on. 

In this chapter we have described various methods by which suitable mode! 
structures can be found and by which we can reject or develop confidence in partic- 
ular models. Among a priori considerations, we may single out the principle “Iry 
simple things first.” This usually means that one should start by testing simple linear 
regressions, such as ARX models in the linear case, and variants with non-linear data 
transformations based on physical insight. whenever appropriate. 

For validation of models, we have described an arsenal of methods of different 
characters. It is wise to include several of these in one’s toolbox. We may point to 
the following: 


e Comparing linear models obtained under various conditions in various model 
structures (including spectral analysis estimates) in Bode plots 


e Comparing measured and simulated outputs for models obtained in different 
structures 


e Testing residuals for independence of past inputs and possibly for whiteness 
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e Monitoring parameter estimate confidence intervals for trailing and leading 
zeros in transfer-function polynomials. as well as for possible loss of local iden- 
tifiability 
Finally, the subjective ingredient in model validation should be stressed. The 

techniques presented here should be viewed as advisers to the user. It is the user that 
makes the decision. In the words of Draper and Smith (1981, p. 273). “The screening 
of variables should never be left to the sole discretion of any statistical procedure.” 
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we have a pure quadratic criterion and apply a Newton method). Increasing the 
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is consistent with a given validation data set. See, e.g.. Skelton (1989), Kosut, Lau. 
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(1996). and Smith and Dullerud (1996). These approaches typically deal with a worst 
case scenario for the noise sequence, i.e.. no averaging properties of the noise are 
taken into account. This also gives rise to so-called hard bounds on the uncertainty, 
See Wahlberg and Ljung (1992), Mäkilä. Partington, and Gustafsson (1995), Mäkilä 
(1992). and Tse. Dahleh. and Tsitsiklis (1993). Another focus is to characterize the 
mode]-error model in suitable terms: Goodwin, Givers. and Ninness (1992). Ninness 
and Goodwin (1995), and Ljung and Guo (1997). 


16.9 PROBLEMS 


16G.1 


16E.1 


16E.2 


16E.3 


Mallow's C,-criterion: Mallows (1973)has suggested the following criterion for sv- 
lecting model structures: 


N 
yeu, Ox) 
— f=! 


te 


$ 


a 


Here p is the number of estimated parameters, and i is an estimate of the innovations 
variance. normally taken as the normalized sum of prediction errors for the largest 
model structure considered. Cp is to be minimized w.r.t p. Discuss the relationship 
between C, and AIC. 


Consider the discrete-time system 
S: a- 0.95q7!}F¥(r) = u(t — 1) 
and a model parametrized in terms of the ARX structure f 
M: y+ ant — 1) +--+ +tasv(t — 5) = ut — 1) 


The true value of as is thus (—0.95)° = —0.77378. Suppose that we obtain a model 
where @}..... ay are exactly correct. but where as = —0.77379. Where are the poles 
of this model? Suggest an alternative model structure for identifving S that is less 
sensitive to numerical errors. 


Consider the ARMAX model 
A(qg)¥(t) = B(g)u(t) + C(qg)e(t) 


Discuss how to test the hypothesis that A(g) = C (q), that is, we have white measure- 
ment errors. 


Consider a model structure 
A(q)y(t) = q™* B(q)u(t) + elt) 


with a time delay of k units. Discuss how to determine a good value of k based on 
several different techniques described in this chapter. 
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16E.4 Show that R defined by (16.21) is the correlation coefficient between ¥(1) and y(t) 


16T.1 


in case the model has been estimated from the data by minimization of J,(/Af} in a 
linear regression model structure. 


Let {€(7)} be a process independent of u(t) and such that 
ett) = H7'(q)e(t) 


is White noise with variance 4. Let 


ult —s — 1) 
b(t) = : 
ult — s — M) 
Let 
1 N N 
= Sgn). f = aol Hugen) g(r) 
: t=] i=l 
Show that 
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SYSTEM IDENTIFICATION IN 
PRACTICE 


In this book we have dealt with the theory of system identification. Performing the 
identification task in practice enhances the “art side” of the topic. Experience. intu- 
ition, and insights will then play important roles. In this final chapter we shall discuss 
the system identification techniques as a toolbox for investigating. understanding. 
and mastering real-life systems. We shall first. in Section 17.1. describe the system 
identification tool in the hand of the user: interactive computing. Section i7.2 dis- 
cusses the practical side of identification: how to approach the task. Then. in Section 
17.3 we discuss a few applications to real data sets. Finally. in Section 17.4 we trv to 
answer the ultimate question: What does system identification have to offer for real 
problems in engineering and applied science? 


17.1 THE TOOL: INTERACTIVE SOFTWARE i 
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The work to produce a model by identification is characterized by the following 
sequence: 


1. Specify a model structure. 

2. The computer delivers the best model in this structure. 
3. Evaluate the properties of this model. 

4. Test a new structure, go to step 1. 


See Figure 17.1, which is a more elaborate version of Figure 1.10. The first thing that 
requires help is to compute the model and to evaluate its properties. There are now 
many commercially available program packages for identification that supply such 
help. They typically contain the following routines: 


A Handling of data, plotting, and the like 
Filtering of data. removal of drift. choice of data segments. and so on. 

B Nonparametric identification methods 
Estimation of covariances, Fourier transforms. correlation and spectral analvsis. 
and so on. 
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Construct the 
experiment and 
collect data 


Should data 
be filtered? 


( Choice of \ 
model structure 


Polish and 
present data 


Data 


Fit the model 
to the data 


Validate 
the model 


Data Model structure 
not OK not OK 


Can the model 
be accepted? 


| Yes 
y 


Figure 17.1 Identification cycle. Rectangles: the computer's main responsibility. 
Ovals: the user`s main responsibility. 


C Parametric estimation methods 

Calculation of parametric estimates in different model structures. 
D Presentation of models 

Simulation of models. estimation and plotting of poles and zeros, computation 

of frequency functions and plotting in Bode diagrams. and so on. 
E Model validation : 

Computation and analysis of residuals (€(t, Ony )); comparison between differ- 

ent models’ properties. and the like. 

The existing program packages differ mainly by various user interfaces and by 
different options regarding the choice of model structure according to item C. 

One of the most used packages is MathWork’s SYSTEM IDENTIFICATION TOOLBOX 
(Sits), Ljung (1995), which is used together with MATLAB. The command structure 
is given by MaTLaB’s programming environment with the work-space concept and 
MACRO possibilities in the form of m-files. Sits gives the possibility to use all model 
structures of the black-box type described in Section 4.2 with an arbitrary number of 
inputs. ARX-models and state-space models with an arbitrary number of inputs and 
outputs are also covered. Moreover, the user can define arbitrary tailor-made linear 
state-space models in discrete and continuous time as in (4.62), (4.84), and (4.91). A 
Graphical User Interface helps the user both to keep track of identified models and 
to guide him or her to available techniques. 
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Other packages of general type include PIM. Landau (1990), the Identification 
module in Matrix-X. Van Overschee et.al. (1994), Adapty. Larimore (1997). and 
the Frequency-Domain Identification Toolbox in Marlas, Kollar (1994). A tool 
for specifically dealing with gray box identification, IDKIT. is described in Graebe 
(1990). 


17.2 THE PRACTICAL SIDE OF SYSTEM IDENTIFICATION 


It follows from our discussion that the most essential element in the process of 
identification—once the data have been recorded—is to try out various model struc- 
tures, compute the best model in the structures. using the techniques of Chapter 
16. and then validate this model. Typically this has to be repeated with quite a few 
different structures before a satisfactory model can be found. 

The difficulties of this process should not be underestimated, and it will require 
substantial experience to master it. Here follows a procedure that could prove useful 
to try out. This is adapted from Ljung (1995). 

Step I; Looking at the Data Plot the data. Look at them carefully. Try to 
see the dynamics with your own eyes. Can vou see the effects in the outputs of 
the changes in the input? Can nonlinear effects be seen. like different responses 
at different levels, or different responses to a step up and a step down? Are there 
portions of the data that appear to be “messy” or carry no information? Use this 
insight to select portions of the data for estimation and validation purposes. 

Do physical levels play a role in the model? If not, detrend the data by removing 
their mean values. The models will then describe how changes in the input give 
changes in output, but not explain the actual levels of the signals. This is the normal 
situation. The default situation, with good data. is to detrend by removing means. 
and then select the first two thirds or so of the data record for estimation purposes. 
and use the remaining data for validation. (All of this comtesponds to the “Data 
Quickstart” in the MaTLaB Identification Toolbox.) 

Step 2: Getting a Feel for the Difficulties Compute and display the spectral 
analysis frequency response estimate, the correlation analysis impulse response esti- 
mate, as well as a fourth order ARX model with a delay estimated from the correla- 
tion analysis, and a default order state-space model computed by a subspace method. 
(All of this corresponds to the “Estimate Quickstart” in the MaTLaB Identification 
Toolbox.) Look at the agreement between the 


e Spectral Analysis estimate and the ARX and state-space models’ frequency 
functions. 

e Correlation Analysis estimate and the ARX and state-space models’ transient 
responses. 

e Measured Validation Data output and the ARX and state-space models’ sim- 
ulated outputs. We call this the Model Output Plot. 


If these agreements are reasonable, the problem is not so difficult. and a relatively 
simple linear model will do a good job. Some fine tuning of model orders and noise 
models may have to be made, and we can proceed to Step 4. Otherwise go to 
Step 3. 
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Step 3: Examining the Difficulties There may be several reasons why the com- 


parisons in Step 2 did not look good. This step discusses the most common ones. and 
how they can be handled: 


Model Unstable: The ARX or state-space model may turn out to be unstable, 
but could still be useful for control purposes. Then change to a 5- or 10-step 
ahead prediction instead of simulation when the agreement between measured 
and model outputs is considered. See (16.20). 


Feedback in Data: If there is feedback from the output to the input. due to 
some regulator. then the spectral and correlation analysis estimates, as well as 
the state-space model, are not reliable. Discrepancies between these estimates 
and the ARX model can therefore be disregarded in this case. In residual anal- 
ysis of the parametric models. feedback in data can also be visible as correlation 
between residuals and input for negative lags. 


Noise Model: [f the state-space model is clearly better than the ARX model 
at reproducing the measured output. this is an indication that the disturbances 
have a substantial influence. and it will be necessary to carefully model them. 


Model Order: If a fourth order model does not give a good Model Output 
plot. try eighth order. If the fit clearly improves, it follows that higher order 
models will be required. but that linear models could be sufficient. 


Additional Inputs: If the Model Output fit has not significantly improved by 
the tests so far. think over the physics of the application. Are there more signals 
that have been, or could be, measured that might influence the output? If so. 
include these among the inputs and trv again a fourth order ARX model from 
all the inputs. (Note that the inputs need not at all be control signals: anything 
measurable. including disturbances. should be treated as inputs). 


Nonlinear Effects: If the fit between measured and model output is still bad. 
consider again the physics of the application. Are there nonlinear effects in 
the system? In that case. form the nonlinearities from the measured data. This 
could be as simple as forming the product of voltage and current measurements, 
if it is the electrical power that is the driving stimulus in. say, a heating process, 
and temperature is the output. This is of course application dependent. It 
does not cost very much work, however. to form a number of additional inputs 
by reasonable nonlinear transformations of the measured signals. and just test 
whether inclusion of them improves the fit. 


General Nonlinear Mappings: In some applications physical insight may be 
lacking, so it is difficult to come up with structured non-linearities on physical 
grounds. In such cases, nonlinear, black box models could be a solution. See 
Sections 5.4-5.6. 


Still Problems? If none of these tests leads to a model that is able to reproduce 
the validation data reasonably well. the conclusion might be that a sufficiently 
good model cannot be produced from the data. There may be many reasons for 
this. The most important one is that the data simply do not contain sufficient 
information, e.g., due to bad signal to noise ratios, large and non-stationary 
disturbances, varying system properties, etc. 
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Otherwise. use the insights on which inputs to use and which model orders to expect 
and proceed to Step 4. 

Step 4: Fine Tuning Orders and Noise Structures For real data there is no 
such thing as a “correct mode! structure.” However. different structures can give 
quite different model quality. The only way to find this out is to try out a number of 
different structures and compare the properties of the obtained models. There are a 
few things to look for in these comparisons: 


e Fit Between Simulated and Measured Output. Look at the fit between the 
model's simulated output and the measured one for the validation data. For- 
mally, pick that model. for which this number is the lowest. In practice. it is 
better to be more pragmatic. and also take into account the model complexity. 
and whether the important features of the output response are captured. 


e Residual Analysis Test. For a good model, the cross correlation function be- 
tween residuals and input does not go significantly outside the confidence re- 
gion. See Section 16.6. A clear peak at lag k shows that the effect from input 
u(t — kì on y(t) is not properly described. A rule of thumb is that a slowly 
varving cross correlation function outside the confidence region is an indication 
of too few poles. while sharper peaks indicate too few zeros or wrong delays. 

For models that are to be used for control design, it is quite valuable to 
display the result of residual analysis in the frequency domain as in Example 
16.3. 


e Pole Zero Cancellations. If the pole-zero plot (including confidence inter- 
vals) indicates pole-zero cancellations in the dynamics. this suggests that lower 
order models can be used. In particular, if it turns out that the order of ARX 
models has to be increased to get a good fit, but that pole-zero cancellations are 
indicated. then the extra poles are just introduced to describe the noise. Then 
try ARMAX. OE. or BJ model structures with an A; or F -polynomial of an 
order equal to that of the number of non-cancelled poles. 


What Model Structures Should be Tested? Well, any amount of time can be spent 
on checking out a very large number of structures. It often takes just a few seconds to 
compute and evaluate a model in a certain structure, so one should have a generous 
attitude to the testing. However. experience shows that when the basic propertics 
of the system’s behavior have been picked up, it is not much use to fine tune orders 
in absurdum just to improve the fit by fractions of percents. For ARX models and 
state-space models estimated by subspace methods there are also efficient algorithms 
for handling many model structures in parallel. 


Multivariable Systems. Multivariable systems are often more challenging to model. 
In particular, systems with several outputs could be difficult. A basic reason for the 
difficulties is that the couplings between several inputs and outputs leads to more 
complex models: The structures involved are richer and more parameters will be 
required to obtain a good fit. 

Generally speaking. it is preferable to work with state-space models in the 
multivariable case. since the model structure complexity is easier to deal with. It is 
essentially just a matter of choosing the model order. 
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Working with Subsets of the Input-Output Channels. In the process of identifying 
good models of a system it is often useful to select subsets of the input and output 
channels. Partial models of the system's behavior will then be constructed. [t might 
not, for example, be clear if all measured inputs have a significant influence on the 
outputs. That is most easily tested by removing an input channel from the data. 
building a model for how the output(s) depend on the remaining input channels. 
and checking if there is a significant deterioration in the model output’s fit to the 
measured one. See also the discussion under Step 3 above. Generally speaking. 
the fit gets better when more inputs are included and worse when more outputs are 
included. To understand the latter fact. it should be realized that a model that has to 
explain the behavior of several outputs has a tougher job than one that simply must 
account for a single output. If there are difficulties to obtain good models for a multi- 
output system, it might thus be wise to model! one output at a time. to find out which 
are the difficult ones to handle. Models that just are to be used for simulations could 
very well be built up from single-output models, for one output at a time. However. 
models for prediction and control will be able to produce better results if constructed 
for all outputs simultaneously. This follows from the fact that knowing the set of all 
previous output channels gives a better basis for prediction than just knowing the 
past outputs in one channel. 

Step 5: Accepting the Model The final step is to accept, at least for the time 
being, the model to be used for its intended application. Note the following. though: 
No matter how good an estimated model looks on the computer screen, it has only 
picked up a simple reflection of reality. Surprisingly often, however, this is sufficient 
for rational decision making. 


17.3 SOME APPLICATIONS 


The Hairdryer; A Laboratory Scale Application 


Consider as a real. but laboratory scale process, Feedback’s Process Trainer PT326, 
depicted in Figure 17.2. Its function is like a hairdryer: air is fanned through a tube 
and heated at the inlet. The input u is the power of the heating device, which is just 
a mesh of resistor wires. The output is the outlet air temperature. It should be said 
that the process is well behaved: it has reasonably simple dynamics with quite small 
disturbances. It also allows measurements with good signal-to-noise ratio. 


Transient Response. The step response of the process is given in Figure 17.3. It 
reveals that the dynamics is simple, with no oscillatory poles. the dominating time 
constant is around 0.4 seconds, and there is a pure time delay of about 0.14 seconds. 


Experiment Design. To collect data for further analysis, a few decisions have to 
be taken. Following the discussion of Section 13.7, we select a sampling interval 
of 0.08 s. since Figure 17.3 clearly shows that the dominating time constant is not 
much less than 0.4 s. A shorter sampling interval would also mean several delays 
between the (sampled) input and output sequences. The input was chosen to be a 
binary random signal shifting between 35 and 65 W. The probability of shifting the 
input at each sample was set to 0.2. A record of 1000 samples was collected, and the 
data set is shown in Figure 17.4. As a first step, the sample means of the input and 


526 Chap. 17 System Identification in Practice 


Figure 17.2 The hairdryer process. 
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Figure 17.3 The step response from the process. 
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Figure 17.4 The data set from the process trainer. This is the same set as dryer2 
supplied with the System Identification Toolbox. 
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Figure 17.5 The step responses according to the correlation analysis estimate 
(solid). the fourth order ARX model (dashed). and the 3rd order state-space 
model (dotted). 


output sequences were removed (see the discussion of Section 14.1). Then the data 
set was split into two halves, the first to be used for estimation, and the second one 


for validation. 


Preliminary Models. Following Step 2 in Section 17.2 we estimate the step/ 
impulse response by correlation analysis, as described in Section 6.1, compute the 
spectral analysis estimate of the frequency function, as well as a fourth order ARX 
model and a state-space model (using a subspace method, according to (7.66)). The 
order of this model is selected automatically, and turned out to be 3 in this case. 
The results of these calculations are shown in Figures 17.5, 17.6. and 17.7. These 
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Figure 17.6 The Bode plots from spectral analysis (solid), the ARX model 
(dashed) and the state-space model (dotted). 
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Measured and simulated model ee ee 


1 a oe | Best Fits | 
f 

0.5 i | 

0 | arxas: 0.096299 

-0.5 | yr oes 0.1089 


55 ae. 57 58 59 60 
Time 
Figure 17.7 The measured output (dash-dotted) for the validation data set. 
together with the output from the ARX model (arxgs. solid). and the state-space 
model (n4s3, dashed) when simulated with the input sequence from the validation 
data set. The figures shown are the RMS-values of the difference between 
measured and simulated output. 


plots show that we have good agreement between the models computed in different 
ways. This is a clear indication that the models have picked up essential! features of 
the true process. Moreover, the comparisons in Figure 17.7 show that even these 
“immediate” models are able to reproduce the input-output behavior quite well. All 
this indicates, according to Step 4 of Section 17.2. that a linear model will do fine. 
and some further work to fine-tune orders and delays is all that remains. 


Further Models. To look into suitable orders and delays. ve compute. simultanc- 
ously. 1000 ARX-models of the type 


y(t) Fayyv(t— 1) +... Fan W(t — na) = biult — ng) t+... + bau — ng — np tl) 


consisting of all combinations z4. np. and ng in the range 1 to 10. The different ARX 
models will be referred to as ARX(na. np. ny). The prediction errors of each model 
are then computed for the validation data, and their sum of squares is computed. The 
result is shown in Figure 17.8, where the fit for the models is depicted as a function of 
the number of parameters used. Only the fit for best model with a given number ot 
parameters is shown. The overall best fit is obtained for a model with 15 parameters. 
which turns out to be na = 6, nè = 9 and ny = 2. The figure also shows thal 
almost as good a fit is obtained also for models with much less parameters. like 4. 
In this case the best orders turn out to be na = 2,n, = 2, and ng = 3. These 
models are added to the mode! output comparisons in Figure 17.9. We see that the 
higher order ARX model is able to reproduce the validation data best. but that the 
differences between the models really are minor. We compute also a state-space 
model of order 6 as well as an ARMAX-model with ng = 3. mp = 3. no = 2. and 
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Figure 17.8 The best fit to validation data for ARX-models as a function of the 
number of used parameters. 


ng = 2. These models are also shown in the same plot. The residuals of the models 
(computed from the validation data) are analyzed in Figure 17.10. It shows that the 
ARMAX(3,3,2,2) model and the ARX(9,6.2) model both give residuals that pass 
whiteness and independence tests. while the model ARX(2.2,3) shows statistically 
significant correlation between past inputs and the residuals. 


Final Choice of Model. Based on this analysis we conclude that there are many 
linear models that give a good fit to the system. The ARX(9,6.2) model shows the 
best fit to validation data, but is at the same time only marginally better than the 
simpler, third order ARMAX(3,3,2.2) model. Both also pass the residual analysis 
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Figure 17.9 Comparisons between several different models. based on the fit 
between measured and simulated output. 
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Autocorrelation of residuals for output 1 
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Figure 17.10 Results from the residual analysis of three different models: solid 
line ARX(9.6.2). dashed line ARX(2.2.3). Dotted lines ARMAX(3,3.2.2). The 
horizontal lines mark the canfidence regions. 


tests. It seems reasonable to pick this simple model as the final choice. The numerical 
value is 


y(t) — 1.4898y(t — 1) + 0.7025y(¢ — 2) — 0.1123 y(t — 3) = 0.0039u(r — 2) 
+ 0.062lu(t — 3) + 0.0284u(r — 4) + elt) — 0.5474e(¢ — 1) 
+ 0.2236e(t — 2) 


The estimated standard deviations of the 8 parameters are 
[0.0574 0.0849 0.0333 0.0015 0.0023 0.0055 ör 0.0523 ] 


We see that the coefficient for u(t — 2) is on the borderline from being significantly 
different from zero. This is the reason why models with delay ng =3 also work well. 
However, the small effect from the term u(t — 2) does give an improved fit. 


The estimated standard deviation of the noise source e(t) is 0.0388. 


A Fighter Aircraft 


Consider the aircraft Example 1.2 with data shown in Figure 1.6. Note that these 
data were collected under closed loop operation. 


To develop models of the aircraft's pitch channel from these data, we proceed 
as follows. The data set is first detrended, so that the means of each signal is removed. 
Then the data is split into one set consisting of the first 90 samples. to be used for 
estimation, and a validation data set consisting of the remaining 90 samples. As a 
main tool to screen models we computed the RMS fit between the measured output 
and the 10-step ahead predicted output according to the different models. In these 
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Measured and 10 step predicted output 
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Figure 17.11 Measured output (dash-dotted line) and 10-step ahead predicted 
output (solid line) for aircraft validation data. using an ARX model with 
Ng = Sn = 4 n = 1k = 1.2.3. 


calculation the whole data set was used—in order to let transients die out—but the 
fit was computed only for the validation part of the data. The reason for using 10 step 
ahead predictions rather than simulations is that the pitch cannel of the aircraft is 
unstable. and so will most of the estimated models also be. A simulation comparison 
may therefore be misleading. 


A typical starting ARX model. using 4 past outputs and 4 past values of each 
of the 3 inputs, gave a fit according to Figure 17.11. We see that we get a good fit, so 
it seems reasonable that we can do a good modeling job with fairly simple models. 
As a next step we calculate 1000 ARX models corresponding to orders in inputs and 
outputs and delays ranging between 1 and 10. (In this case all 3 input orders were 
kept the same.) The best 1-step ahead prediction fit to the validation data turned out 
to be for a model 


y(t) + ayy(t — 1) +... + an, y(t — na) 
= bu (t — 1) + bma — 1) + Oust — 1) + elt) (17.1) 


with na = 8. See Figure 17.14. Note in particular that models that use many 
parameters are considerably much worse for the validation data. Models of the kind 
(17.1) with other values of na were also estimated, as well as ARMAX models and 
state-space models using the N4SID method. A comparison plot based for several 
such models is shown in Figure 17.12. The best 10-step ahead prediction fit is obtained 
for the ARX model with na = 4. (Note. though. that the best 1-step ahead prediction 
is obtained for na = 8. as was said above.) The comparison for that model is shown 
in Figure 17.13. The result of residual analysis for this model on validation data is 
shown in Figure 17.15. We see that this simple model with 7 parameters is capable 
of reproducing new measurements quite well, at the same time it is not falsified by 
residual analysis. 
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Figure 17.12 As Figure 17.11 but for several different models. 
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Figure 17.13 As Figure 17.11 but for the best ARX model. 
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Figure 17.14 Comparisons of the i-step ahead prediction error for 1000 
ARX-models for the aircraft data. 
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Figure 17.15 Residual analysis the best ARX model for the validation aircraft 


data. 


Buffer Vessel Dynamics 


This example concerns a typical problem in process industry. It is taken from the 
pulp factory in Skutskär. Sweden. Wood chips are cooked in the digester and the 
resulting pulp travels through several vessels where it is washed, bleached etc. 


The pulp spends about 48 hours total in the process, and knowing the residence 
time in the different vessels is important in order to associate various portions of 
the pulp with the different chemical actions that have taken place in the vessel at 
different times. Figure 17.16 shows data from one buffer vessel. We denote the 
measurements as follows: 


x(t) : The «-number of the pulp flowing out 
u(t) : The «-number of the pulp flowing in 
f(t) : The output flow 


h(t) : The level of the vessel 


The problem is to determine the residence time in the buffer vessel. (The «-number 
is a quality property that in this context can be seen as a marker allowing us to trace 
the pulp.) 

To estimate the residence time of the vessel it is natural to estimate the dynamics 
from u to v. That should show how long time it takes for a change in the input to 
have an effect on the output. 

We can visually inspect the input-output data and see that the delay seems to 
be at least an hour or two. The sampling rate may therefore be too fast and we 
resample the data (decimate it) by a factor of 3, thus giving a sampling interval of 12 
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k-number of Inflow inflow 
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Figure 17.16 From the pulp factory at Skutskär. Sweden. The plots show the «-number of 

the pulp flowing into a buffer vessel. The «-number of the pulp coming out from the buffer 

vessel. Flow out from the buffer vessel. Level in the buffer vessel. The sampling interval is 4 
minutes, and the time scale shown in hours. 


minutes. We proceed as before, remove the means from the x -number signals. split 
into estimation and validation data and estimate simple ARX-models. This turns out 
to give quite bad results. 

According to the recipe of Section 17.2 we should then contemplate if there 
are more input signals that may affect the process. Yes, clearly the flow and level of 
the vessel should have something to do with the dynamics, so we include these two 
inputs. The best mode! output comparison was achieved for an ARX model with 4 
parameters associated with the output and each of the inputs, a delay of 12 from u 
and a delay of 1 from f and k. This comparison is shown if Figure 17.17. This does 
not look good. 


Measured and simulated model output 
Best Fits 


arx4412: 1.9452 


55 60 65 70 75 80 85 
Time 


Figure 17.17 The measured validation output y (dash-dotted line) together with 
the best linear simulated model output for the system from u. f.h to y. 
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Some reflection shows that this process indeed must be non-linear (or time- 
varying): the flow and the vessel level definitely affect the dynamics. For example. if 
the flow was a plug flow (no mixing in the vessel) the vessel would have a dynamics of 
a pure delay equal to vessel volume divided by flow. This ratio, which has dimension 
time, is really the natural time scale of the process, in the sense that the delay would 
be constant in this time scale for a plug flow. even if vessel flow and level vary. 


Let us thus resample the date accordingly. i.e. so that a new sample is taken (by 
interpolation from the original measurement) equidistantly in terms of integrated 
flow divided by volume. In MatTLas terms this will be 


z = [y,ul; pf = £./h; 

t =1:length(z) 

newt = interp1(cumsum(pf+0.00001),t, [pf£(1):sum(pf)]’ ); 

newz = interpl(t,z, newt); 

yi=newz(:,1); ul=newz(:,2) 

(The small added number to p£ is in order to overcome those time points where the 
flow is zero.) The resampled data are shown in Figure 17.18. We now apply the same 


procedure to the resampled data u; and yı. The best ARX model fit was obtained 
for 


y(t} + ayit — 1) +... + ayit — 1) = built — 9) + elt) 
Slightly better fit was obtained for an output-error mode] (4.25) with the same orders 
(na = 1,nf = 4, ny = 9). The comparison is shown in Figure 17.19. This “looks 
good.” 


x - number of Inflow 
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« - number of Outflow 
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Figure 17.18 The input and output «-numbers resampled according to the text. 
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Measured and simulated model output 


Best Fits 


45 50 55 60 65 70 75 
Time 
Figure 17.19 The measured validation output v; (dash-dotted line) together with 


simulated model outputs from resampled u;. An ARX(419) model is shown as 
well as an OE model of the same orders. 


The impulse responses of these models are shown in Figure 17.20. We see a 
delay of about 1.75 hours and then a time constant of about 2 hours. The vessel thus 
gives a pure delay as well as some mixing of the contents. The two impulse responses 
are in good agreement, if we take into account their uncertainties. See Andersson 
and Pucar (1995)for a more comprehensive treatment of the data in this example. 


Impulse Response Impulse Response 

0.2 0.2 
0.15 0.15 
0.1 0.1 
0.05 0.05 

6) (8) 
0.05 0.05 

O 1 2 3 4 5 6 7 8 O 1 2 3 4 5 6 7 8 

Time (h) Time (h) 


Figure 17.20 The impulse response of the ARX (solid) and OE (dashed) models. The right 
figure shows also the corresponding estimated 99% confidence intervals. 


17.4 WHAT DOES SYSTEM IDENTIFICATION HAVE TO OFFER? 


System identification techniques form a versatile tool for many problems in science 
and engineering. The techniques are. as such, application independent. The value of 
the tool has been evidenced by numerous applications in diverse fields. For example. 
the proceedings from the IFAC symposium series in System Identification contain 
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thousands of successful applications from a wide selection of areas. Still. there are 
some limitations associated with the techniques, and we shall in this final section give 
some comments on this. 


Adaptive and Robust Designs: Have They Made Modeling 
Obsolete? 


As we discussed in Section 1.2, models of dynamical systems are instrumental for 
many purposes: prediction, control. simulation, filter design, reconstruction of mea- 
surements, and so on. It is sometimes claimed that the need for a model can be 
circumvented by more elaborate solutions: adaptive mechanisms where the decision 
parameters are directly adjusted or robust designs that are insensitive to the cor- 
rectness of the underlying model. One should note, though. that adaptive schemes 
typically can be interpreted as the recursive identification algorithms described in 
Chapter 11 applied to a specific model structure (e.g., the model parameterized in 
terms of the corresponding optimal regulator): see Chapter 7 in Ljung and Söder- 
ström (1983). The model-building feature is thus very much present also in adaptive 
mechanisms. 


Robust design is based on a nominal model and is determined so that good 
operation is secured even if the actual system deviates from the nominal model. 
Usually, a neighborhood around the nominal model can be specified within which 
performance degradation is acceptable. It is then a very useful fact that models 
obtained by system identification can be delivered with a quality tag: estimated 
deviations form a true description in the parameter domain or in the frequency 
domain. Such models are thus suited for robust design. 


Limitations: Data Quality 


It is obvious that the limitation of the use of system identification techniques is linked 
to the availability of good data and good model structures. Without a reasonable 
data record not much can be done, and there are several reasons why such a record 
cannot be obtained in certain applications. A first and quite obvious reason is that the 
time scale of the process is so slow that any informative data records by necessity will 
be short. Ecological and economical systems may clearly suffer from this problem. 
Another reason is that the input may not be open to manipulations, either by its 
nature or due to safety and production requirements. The signal-to-noise ratio could 
then be bad. and identifiability (informative data sets) perhaps cannot be guaranteed. 
Bad signal-to-noise ratios can, in theory, be compensated for by longer data records. 
Even if the plant, as such. admits long experimentation time. it may not always be a 
feasible way out. due to time variations in the process, drift. slow disturbances, and 
so On. 


Finally, even when we are allowed to manipulate the inputs. can measure for 
long periods. and have good signal-to-noise ratios, it may still be difficult to obtain 
a good data record. The prime reason for this is the presence of unmeasurable 
disturbances that do not fit well into the standard picture of “stationary stochastic 
processes”. We have discussed how to cope with such slow disturbances in Section 


538 


Chap. 17 System Identification in Practice 


14.1 and how to handle occasional “bursts” by robust norms (Section 15.2). and such 
measures may often be successful. The fact remains though: data quality must be a 
prime concern in system identification applications. This also determines the cost of 
the exercise. 


Limitations: Model Structures 

It is trivial that a bad model structure cannot offer a good model. regardless of the 
amount and quality of the available data. For example, the ARX model structure 
in Figure 17.17 can never provide a good description of the buffer vessel dynamics 
even if fitted to data collected over several years. The crucial nonlinear mechanisms 
must be built in, and this requires physical insight. 

The first problem thus is whether the process (around its operation point of 
interest) admits a standard. linear. ready-made (“black-box”) model description, or 
whether a tailor-made model set must be constructed. In the first case, our chances 
of success are good; in the second. we have to resort to some physical insight before 
a model can be estimated or hoping that the nonlinear dynamics can be picked up by 
a nonlinear black-box structure. This problem clearly is application dependent and 
therefore not so much discussed in the identification literature. It cannot. however. 
be sufficiently stressed that the key to success lies here: Thinking. intuition. and 
insights cannot be made obsolete by automated model construction. 


Appendix I 


SOME CONCEPTS FROM 
PROBABILITY THEORY 


In this appendix we list some basic concepts and notions from probability theory 
that are used in the book. See a textbook like Papoulis (1965)or Chung (1979)for a 
proper treatment. 

A random variable e describes the possible numerical outcomes of experi- 
ments whose results cannot be exactly predicted beforehand. The probability of the 
numerical values falling in certain ranges is then expressed by the probability density 
function (PDF) fe(x): 


b 
Pla<xe<b= f fe(x)dx (1.1) 


If e may assume a certain value with nonzero probability, we can think of fe contain- 
ing a 6-function component of that value. A formal treatment then replaces (1.1) by 
a Stieltjes integral. but this is not essential to our needs. 

For a random vector 


ei(t) 
e= : 
en(t) 
a corresponding PDF f.(x) = fel(X1.... , Xn) from R” to R is defined and 
P(e € B) = J fe(x)dx (I.2) 
x€B 


B here is a subset of R” and P(A) means “the probability of the event A.” f(x) is 
also known as the joint PDF for e,,... , en. The expectation or mean value of e is 
defined as 


Ee s xfe(x)dx (1.3) 
RY 


while the covariance matrix is 


Cove = Efe — m)(e — m)’: m = Ee (1.4) 
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The random vector e is said to have Gaussian or normal distribution if 


i 


1 
fex) = (2x )*'2 (det P)!/? 


exp[—3(x — m)" Px — m)] (1.5) 


The mean is then mm and the covariance matrix is P. This will be written 
e € Nim. P) (1.6) 
With two random variables y and z we may define the joint PDF as f(xy. .x-). The 


probabilities associated with outcomes of y only, disregarding z, are then given by 
the PDF for y. fy(xy): 


b 
P(a < y < b) = | fy Qy)dxy 


Since 
Pla < y < b) = Pla < y < band —-o < : < +a) 
b x 
= J f f(xy. x:)dx-dxy 
Xy =a V t= 
we find that 
X 
haD =f fa zddx 17 
12=-N 
We can now introduce the conditional PDF of z given y as j 
f (xy, Xz) 
ayal) = —=—— (1.8) 
fay y f(xy) 
We then have 
b 
Pla <z <b =x)= f faix (|x, axe (1.9) 
a 


Here P(A|B) is the conditional probability of the event A given the event B. Intu- 
itively, we can think of (1.9) as the probability that z will assume a value between «u 
and b if we already know that the outcome of y was x,. Note that formal definitions 
of these concepts require more attention, and the reader should consult a textbook 
on probability for that. The expression (1.8) can be seen as a version of Baves's rule: 


Pala) = “=” (1.10) 
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Two random variables y and z are independent if 


Saxd) = frar) fa) (1.11) 
Then 
Fear = J (1.12) 


We occasionally deal with complex-valued random variables in this book. If e may 
assume complex values. we define the covariance matrix as 


Cov e = E(e — m€ — m) (1.13) 


where the overbar means complex conjugate. Notice that this concept does not give 
full information about the covariation between the real and imaginary parts of e. 
For a complex-valued random vector e. the notation 


e € N.(m. P) (1.14) 


will mean that 


1. The real and imaginary parts of e are jointly normal. 
2. Ee = m (a complex number). 

3. Cov e = P [defined as in (1.13)]. 

4. Re e and Im e are independent. 

5. Cov Re e = Cov Im e = $P. 


Let v(t), ż = 1.2,... , be a sequence of random vectors (a discrete time 
stochastic process). The outcome or realization of this sequence will then be a 
sequence of vectors. Suppose that the event that this sequence converges to a limit 
y* (that may depend on the realization) as t tends to infinity has probability 1. Then 
we say that {v(t)} converges to y* (a random vector) with probability 1 (w.p. 1) (or 
“almost surely.” a.s.. “almost everywhere.” a.e.): 


¥(t) > y*, w.p. last > œ (1.15) 


Often in our applications ¥* will in fact not depend on the realization. 
If the associated sequence of PDFs. f,;:)(x). converges (weakly) toa PDF f*, 


frye) > f* (x) (1.16) 


we say that {\(t)} converges in distribution tothe PDF f*. Inthe special case when 
f* isthe Gaussian distribution (1.5), we say that {y(¢)} is asymptotically normal with 
mean m and covariance P. and denote it as 


y(t) € AsN(m. P) (1.17) 
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Useful theorems for proving results like (1.17) are given in Problem 11.3 and in 
Lemmas 9A.1. 9A.2. They are usually known as central limit theorems (CLTs). 
To prove convergence with probability 1, the following version of Borel-Cantelli’s 
lemma is a good tool: 


oc 
“Tf >> P(S) > €) < x, foralle > 0, then 
k=l (1.18) 


y(t) > Owp. last > oc” 
(see Chung, 1974, for a proof). 
To estimate probabilities of this kind, Chebyshev's inequality is useful: 


1 4 
P(jy| > £) < a Ely (1.19) 


Appendix II 


SOME STATISTICAL 
TECHNIQUES FOR LINEAR 
REGRESSIONS 


The purpose of this appendix is twofold: First, to provide a refresher of basic statis- 
tical techniques so as to form a proper background for Part II of this book: second. 
methods, algorithms. theoretical analysis, and statistical properties for linear regres- 
sion estimates are all archetypal for the more complicated structures we discuss in 
Part II. This appendix can therefore also be read as a preview of ideas and analysis. 
maximally stripped from technical complications. The appendix has a format so that 
it can be read independently of the rest of the book (and vice versa). 


1.1 LINEAR REGRESSIONS AND THE LEAST SQUARES ESTIMATE 


Linear regressions are among the most common models in statistics, and the least- 
squares technique with its root in Gauss’s (1809) work is certainly classical. Treat- 
ments of these techniques are given in many textbooks, and we may mention Rao 
(1973)(Chapter 4), Draper and Smith (1981), and Daniel and Wood (1980)as suitable 
references for further study. 


The Regression Concept 


The statistical theory of regression is concerned with the prediction of a variable y. 
on the basis of information provided by other measured variables 91. ... 4. The 
dependent variable yv could. for example. be the yield of a certain crop. while the 
independent variables g; (the regressors) give information about rainfall. sunshine. 
soil quality. and the like. There are abundant examples of this situation across all 
fields of science and society. The dynamical systems that we consider in Part I clearly 
form another application of the regression concept, y being the output of a system 
(at a given time) and g; containing information about past behavior. Let us denote 


pı 
p2 
g = . 


Yd 
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The problem is to find a function of the regressors g(¢) such that the difference 


vy — gly) 


becomes small (i.e., so that f = g(g) is a good prediction of y). If y and ¢ are 
described within a stochastic framework. one could. for example. aim at minimiz- 
ing 


E[y — eI] (L1) 


It is well known that the function g that minimizes (II.1) is the conditional expecta- 
tion of y, given Qi... d: 


stv) = E[xle] (11.2) 


This is also known as the regression function or the regression of v ong. 
Another approach would be to look for the function g(y) that has maxima] 


correlation with y. The answer is essentially the regression function. See Prob- 
lem If.2. 


Linear Regressions 


With unknown properties of the variables y and ø. it is not possible to determine 
the regression function g(q@) a priori. It has to be estimated from data and must 
therefore be suitably parametrized. The special case where this parametrization is 
constrained to be linear has been studied extensively. We are then trving to fit v to 
a linear combination of the ¢;: 


glg) = A191 + Oop. +... + Oagy. (11.3) 


í 


With the vector 


ĝi 
02 
0 = i 
64 
(11.3) can be written 
aly) = 916 (L4) 


Remark: Of course. “affine™ functions 


glp) = bar + 079 (ILS) 


could also be considered. By extending the regressors by the constant gg, = 
l and the parameter vector @ accordingly. the case (II.5) is however subsumed 
in (11.4). 
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Least-squares Estimate 


Typically. we are not supplied with exact a priori information about the relationship 
between y(t) and g(t). What we have instead are “historic data.” a collection of 
previous observations of related values of y and ø. It is convenient to enumerate 
these values using an argument /: 


yt) ge). t= 1... N (1.6) 


With the historic data we could replace the variance (II.1) by the sample variance 


N 
1 3 
NW Yb - go 


i=1 


In the linear case (11.4), we thus have 


N 
] 2 
Vy(0) = a ù Do - 978] (11.7) 
t=] 


instead of (II.1), and a suitable @ to choose is the minimizing argument of (11.7): 
ôy = arg min Vy (6) (11.8) 
This is the least-squares estimate (LSE). Based on the previous observations, we 
would thus use X 
gy" by 
as a predictor function. 

Notice that this method of selecting @ makes sense whether or not we have 
imposed a stochastic framework for the problem. The parameter Ow is simply the 
value that gives the best performing predictor when applied to historic data. This 
“pragmatic” interpretation of the LSE was given also by its inventor. K. F. Gauss: 


In conclusion, the principle that the sum of the squares of the differences 
between the observed and the computed quantities must be minimum may, 
in the following manner, be considered independently of the calculus of 
probabilities (Gauss, 1809). 


The unique feature of (II.7) is that it is a quadratic function of 0. Therefore, it can 
be minimized analytically (see Problem 7D.2). We find that all 8y that satisfy 


N N 
1 a ly 
È $ vorto] On = = $ gye) (11.9) 
t=1 f=] 


yield the global minimum of Vy(@). This set of linear equations is known as the 
normal equations. If the matrix on the left is invertible, we have the LSE 


N N 


-1 
à 1 1 
Oy = È Eroro y PE) (11.10) 


t=1 t=1 
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Matrix Formulation 


For some calculations, the expressions (11.6) to (11.10) can be written more conve- 
niently in matrix form. Define the N x 1 column vector 


y(1) 
y(N) 
and the N x d matrix 
gy" (1) 
Oy = : (11.12) 
g"(N) 
Then the criterion (II.7) can be written 
1 1 
W(6) = 1¥w — eyo = ay Ew — Dn) (Yy — OyO) (11.13) 
The normal equations take the form 
[PL Oy]Oy = OLY (1.14) 
and the estimate 
On = [D Oy] OL Yn (IL.15) 


We may in (II.15) recognize the (Moore-Penrose) pseudoinverse of Py: 
di, = (0), by) oF j (11.16) 


Equation (II.15) thus gives the pseudoinverse solution to the overdetermined 
(N > d) system of linear equations 


Yn = Pn (11.17) 


Geometric Interpretation 


The least-squares solution can be given a geometric interpretation that may be helpful 
when determining certain properties. Let 


On = [hi..- ġa] 
and consider Yy and ¢;...@g as vectors in the vector space R“. The problem 
expressed in (11.17) is to find a linear combination of the vectors ¢;, i = 1..... d. 


that approximates Yy as well as possible. Let Dg be the d-dimensional subspace 
that is spanned by the ¢;. If Yy happens to belong to this subspace. we can describe 
it as a unique linear combination of ¢;. Otherwise, the best approximation of Yx in 
the subspace Dz is the vector in Dg that has the smallest distance to Yẹ., which is 
well known to be the orthogonal projection of Yy on Dg. See Figure II.1. 
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Q: 


$i 


Figure II.1 The least-squares solution as orthogonal projection. (d = 2 and 
N =3. 


Let this projection be denoted by Êy. Since it is the orthogonal projection, we 
have 


(Yn — În) L ¢; 
That is, 
(Yv — Êy) hi =0, i=1...d 
and since Yy € D4, we have for some coordinates 6; 
d 
Yn = > ôg; 
j=l 


This gives 


il 
pma 
a 


d 
Yvoi = J ojh i (11.18) 
j=l 


which in matrix form is (11.14). 


Weighted Least Squares 


In (11.7) the different observations are given equal weight in the criterion. Sometimes 
there is occasion to consider a weighted criterion 


N 
VrO) = Da, [y(e) — 979] (11.19) 


t= 
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The reason for this could be twofold. 


1. The observations y could be of varying reliability. Some observation could. 
for example. be subject to more disturbances and should therefore be down- 
weighted. (11.26); 


2. The observations could be of varying relevance. It is perhaps not believed that 
a linear model holds over all ranges of y. An observation, corresponding to 
g in such a questionable region, even if accurate, should therefore carry less 
weight. (11.21) 


Q1 0 
On = oe (11.22) 
0 QN 


the criterion (II.19) can be written 


With the diagonal matrix 


Vv (0) = (Yn — DNO) Ov(Yy — ONO) È Yx — PaO, (1.23) 


It is immediate to verify that the minimizing element is given by 


[PONEN] PE On ¥n 


N ~1 N 
[Laver Y av(t)y(r) (11.24) 


i=l t=1 


A 


On 


There could also be reason to use the criterion (J].23) for a general, symmetric 
positive definite Qx. The former part of (11.24) then still hélds. To interpret what is 
going on in terms of the original measurements, it is convenient to factorize Q y: 


Qu = LL DyLy (11.25) 


with Ly as a lower triangular matrix with 1’s along the diagonal: 


1 0 0 Dee 
ay 1 0 gia) 

Ly = £31 £39 1 0...0 (11.26) 
Car ee ie: l 


and Dy a diagonal matrix as in (11.22). Then (11.23) takes the form 


Vy (6) = IYn — Py Ol5, (11.27) 


~ 


Yy =LyYy, Oy = LyOy (11.28) 
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The elements of these matrices are 


t-1 
vt) = >, Gu- y(t — k) (IT.29a) 
k=0 
1—1 
Bt) = Yb -wylt — k) (11.29b) 
k=0 
We thus have 
N > 
V = Day [Fry — BOT (11.30a) 
t=] 


D 
Z 
ll 


N =F y 
K [Zaos] X a, P(N F(t) (11.30b) 
t=1 


t=l 


The effect of the general norm Q y in (11.23) is consequently that the original obser- 
vations have been filtered by the filter (1.25) to (11.29). 


Residuals and Prediction Errors 
The difference 
e(t.6) = v(t) — g(t) (11.31) 
is the error associated with the value @. We shall call this error the prediction error 
corresponding to 0. The vector of prediction errors is 
e(1.8) 
En(@) = i (11.32) 
E(N. 0) 
and the criteria (11.7) and (JJ.23) are just different quadratic norms of this vector. 
Norms Qx that are not diagonal correspond to sums of squares of filtered prediction 


errors analogously to (11.29). 
We shall call 


éy(t) = e(t. Oy) 


the residuals (“leftovers”) associated with the model By. 
Consider now for simplicity the case Qa = /. Denote the residual vector 


Ey = Ey (Ôn) (11.33) 


and the predicted output 
Ëy = ®ydy (11.34) 
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From the geometrical interpretation we know that Ey and Yy are orthogonal. 
Hence 


IYn? = (Yn + Èn) Ën + Ex) = i¥yP + [Ewl? (11.35) 


which also can be written 


N 
by (t) = Dino + Sat) (11.36) 


which shows how the sum of squared observation splits into predictions 


Sna) = (On 
and residuals 
ên (t) = e(t, Ox) (11.37) 


The ideal situation is when the predicted outputs +, are capable of explaining a 
major part of the actual output. The ratio 


N 


Dro G 
R? = =1- > (11.38) 
Leo (t) y yo 
t=l 


measures the proportion of the total variation of y that is explained by the regression. 
It is known as the multiple correlation coefficient (squared) and is often expressed 
in percent. Sometimes the mean value of y is subtracted from ¥ and y before cal- 
culating Ry. / 


Quality of the Parameter Estimate 


To investigate what properties the estimate 8, may have, let us assume that the 
actual measurements y(f),f =1..... N, can be described by 


y(t) = g2(t)@ + wolt) (11.39) 


where {u'o(t}} is some disturbance or error sequences of yet unspecified nature. If 
this sequence has some “nice” (to be specified later) properties, it is natural to call 
Oy “the true parameter.” 
If we denote 
wo(1) 
Wy = : (17.40) 
wo(N) 


we may write (I1.39) as 
= By + Wr (11.41) 
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Inserting this into (11.24) gives 
Oy = [Oh Ov On] D] On[OnO% + Wy] = O + Ôx (11.42) 


where 
Ön = [OL OvOy] OL On Wy 


which in the case Oy = 1/N - / also reads 


N —1 N 
1 r ls 
by = È S gny ol È Deore (11.43) 


t=1 


This expression for the parameter error is of purely algebraic nature and holds for 
all sequences {u'g(r)}. If g@ and w are quasistationary, we see that. as N tends to 
infinity. Oy tends to 


8 = [R,(0)]* Ry (0) (11.44) 


with the notation (2.62). If R,(0) is invertible [which corresponds to an assumption 
that the sequence g(t) has full rank] and Ry» (0) is zero (which corresponds to a 


certain “independence” between the regressors and the disturbance). then ĝy will 
tend to the true value 69 when more observations become available. 

To be able to tell more about the properties of Ôx. it is natural to create a 
probabilistic framework for the disturbance sequence. This will be done in the next 
section. 


1.2 STATISTICAL PROPERTIES OF THE LEAST-SQUARES ESTIMATE 


A Probabilistic Setup 


To achieve further results on the properties of the LSE, we shall introduce more 
specific assumptions about the generation of the observations y. Typical assumptions 
follow: 


e The sequence of regressors {y(t)} is a deterministic sequence. (11.45) 


e y(t) = gp! (1) + wo(t) where wo(t) is a sequence of independent random 
variables with zero mean values and variances Ag. (11.46) 


Let it immediately be said that assumption (11.45) is too restrictive for most 
applications to system identification. since then the regressors typically contain past 
outputs. This means that the analysis will not be applicable to system identification 
methods. However, (11.45) greatly simplifies the analysis, at the same time as the 
results are archetypal for what holds also in the general case. The contents of this 
section can therefore be read as a simple preview of the material of Chapters 8 
and 9. 
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In this section we shall relax assumption (I1.46) somewhat. We shall allow more 
general disturbances {u'y(t)} and also. occasionally. that the true regression function 
is not compatible with the given linear model. We thus have the description 


y(i) = glot) + wol) (I1.47a) 
where 
g(t) is deterministic (11.47b) 
Euy(t) = 0 (IL47c) 
Ewgaltunls) = rn (I1.47d) 


In matrix form. with (11.11). (11-12), (11.40), and 


8i(e(1)) 
Gy(®y) = (11.48) 
8a (oN )) 
we have 
Yy = Gu(®y) + Wa (11.49a) 
EWn = 0 (I1.49b) 
EW, Wg = Ry (11.49c) 
We shall frequently specialize to 
Gan) = noo j (11.50) 
and 
Ry =à- (11.51) 


Convergence and Consistency 


Consider the special case (11.10) (i.e. Oy = (1/N)/). Then if (11.50) holds we can 
write, as in (11.43). 


N —1 N 

r 1 < 1 

Ôn — bo = È Zoro È oouo (11.52) 
t= f=] 


The first sum consists of deterministic variables. Suppose that it converges to an 
invertible matrix R,(0): 


N 
1 
F ener) —> R,(0) asN > 2%, — R,(O)invertible (11.53) 


t=] 
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The second sum consists of random variables with zero mean values under assump- 
tion (1I.47b. c. d). Various “strong laws of large numbers” describe when such sums 
converge to zero with probability 1. This will depend on the properties of {w(t}. 
See, for example. Chung (1974)or Hall and Heyde (1980)for thorough treatments of 
such results. Our Theorem 2.3 shows that 


N 
1 
7 X olt)wolt) — 0, w.p. las N > 90 (11.54) 


t=] 


if {wo(t)} can be described as filtered white noise as in (2.88) and {y(1)} is a bounded 
sequence. When (II.53) and (11.54) hold, we have 


6y > 0, wp.lasN > œ (11.55) 
This means that Êy is a strongly consistent estimate of 8. 


Bias and Variance 


Consider the general weighted LSE 
Ôn = [PL On Ox] ' OL ONYN (11.56) 
Since Py and Qy are deterministic. it is easy to calculate the expectation of (11.56). 
E6y = 0* = [DP Ov On] D OnGn (Oy) (11.57) 


When (II.50) holds, we have 
E6y = 6 (11.58) 


which means that the estimate by is unbiased in case a true description (II.50) is 
available. 
For the parameter difference, we obtain from (11.56), (11.57). and (11-49a) 


2 


by = On — Eby = [Ph On Oy] D On Wy (11.59) 


which gives the covariance matrix 


Covéy = Py = E064 = [P On On] OF OvRv On nD On Oy] | 
(11.60) 
Notice that this holds regardless of the form of Ga(®,y). 
For the nonweighted LSE [Qy = (1/N) - /] and independent disturbances, 
case (11.51), we find that 


N a | 
Py = dol[®yPn]} = Ao p vo"o (IL.61) 


r=| 
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The covariance matrix of Ox is thus determined by the residuals’ variance Ap and 
properties of the regressors. When the g(t) are open to manipulation during the data 
collection, it is an important experiment design issue to make the inverse in (11.61) 
“small.” subject to constraints that may be at hand. 

Note that computation of Py requires knowledge of Ao. Since this may not be 
known to the user. it is important to estimate it from data. 


Estimating the Noise Variance 


We have the following result: 


Lemma IL.. Let the criterion be given by {I1.7) and suppose that (11.49) to (IL51) 
hold. Then 


N 


N a 1 
—— Vu (Ou) = ——— 
Wag NOON) Nd 2 


>= 
= 
ll 


[ye — ZOZ (11,62) 


is an unbiased estimate of Ay. (Recall that d = dim 0.) 
Proof. We have 
Yy — nôn = Yn — Oy[OZ Oy] OL Ys 
= [J — nibi On}? O)][Oy% + Wy] (11.63) 
= [I — Psl, 100] Wy È Fy Wy 
Note that Fy, Fy = Fy. Hence j 


E|Yy — ®yéy\? = EWF] FyWy = EW] FyWy 


i 


Ew FyWyW) = dott Fy 
Now 


tr Fy = tf — daib on OR] = trI — woy[Ol dy] oh 
(11.64) 


trl — [tion] Oos =N -d 


Recall that Fy isan N x N matrix and DT dy isad x d matrix. Here we made 
use of tr AB = tr BA. 
Consequently, 


1 5 
= g~ — byOy/? = Ao 


C! 


and the lemma is proved. 
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Notice that both assumptions (11.50) and (11.51) are important for this result. 
It is not feasible to estimate a general covariance matrix Ry from data. Also, if 


the true description is of the general form (I1.47a). the estimate 2, will contain a 
contribution 
N 


eo 5 [e(o = o%ndy]) 


=I 


that will not tend to zero. This makes Ay systematically larger than Ag. and the noise 
level is overestimated. This points to a very typical dilemma in many applications: 
to distinguish noise effects from bias effects. 


Minimizing the Variance 


From (11.60) we know that the variance of Oy depends on the choice of norm Qy. 
One might ask what the best choice of Q is in the sense that it gives the smallest 
covariance matrix. This question is answered by the following lemma: 


Lemma II.2. Let Ra; be a positive definite matrix and define 
Py(Q) = (OL OOy] PORQ n PR n] 
Then for all symmetric, positive semidefinite Q, 
Py(Ry') < Py(Q) 
Proof. The matrix 


Dy Jez] Dy Er a DODN | 
OT OR, | * LOLORN D Oby OL ORVOON 


is positive semidefinite by construction. Hence, according to Problem 7D.8, 


DIR Oy > PL Os([PLORvOPn] OL QON 


Inverting both sides proves the lemma. = 


The estimate (11.24) obtained for Q = Rọ’ is also known as the Markov 
estimate or the best linear unbiased estimate (BLUE). Notice that it requires 
knowledge of the covariance matrix Ry, which might not be a realistic assumption. 

In case the noise terms in (LH.47) are independent with different variances, 


Ew,(t) = hy 


the lemma tells us that the variance of the estimate is minimized when the criterion 
(11.19) is used with 


a= (11.65) 
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That is. the observations should be weighted with their inverse variances. This choice 
of weights is thus optimal] in the sense of (11.20). Notice. though. that when (11.50) 


does not hold, the weights œ; will also affect the bias of by. and this effect may be in 
conflict with (11.65). 


Distribution of the Estimates 


The estimates 6x of 6) and he of Ay are random variables, since they are constructed 
from the random variables {¥(r)}. Itis thus of interest to determine their distribution. 
In order to do that we shall introduce the following additional assumption: 


The vector Wy of disturbance terms has a Gaussian distribution. (11.66) 


This assumption implies that Ya will also have a Gaussian distribution with mean 
value Gay(®j,) and variance Ry: 


Yn € N(Gn(®n), Ry) (11.67) 
Since Êy in (11.56) is a linear combination of Yx, 6x will also be Gaussian: 
dy € N(6*, Py) (11.68) 


where 6* and Py are given by (11.57) and (11.60). respectively. This answers the 
question of the distribution of Ôx under assumption (11.66). 

Even when the observations are not normally distributed, it is often the case 
that the distribution of Êy approaches the normal distribution as N increases to 
infinity. This follows from application of central limit theorems (CLTs) to the sum 
of random variables that constitutes the estimates. See Problem JI.3. 

The distribution of Ay in (11.62) is somewhat more technical to determine. Let 
us again consider the special case (11.50) and (11.51) and jet Vu (@) be defined by 
(11.7). Then 

N 


N 
N - Vy) = Y [O — oO] = Y (11.69) 
=1 


t=] 


Now, under (11.51) and (11.66). {w(t} is a sequence of independent Gaussian 
random variables with variances Ay. Hence 


N 
= Vn(Go) € x°(N) (11.70) 
() 


That is, the left side is y°-distributed with N degrees of freedom, by definition of 
the x?-distribution. When 9p is replaced by 0y . we have a related result: 
Lemma II.3. Assume that (11.49) to (11.51) and (11.66) hold. Then 

N : iw oe i 

x Wvs = Do [vio — ody] ew- LT 


t=l1 
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Proof. Compare with the proof of Lemma II.1. Let 
Fy = I — ®x[, Oy] OR 


which is asymmetric N x N matrix. Then, as in (11.63), Fy = Fy Fn, which implies 
that all eigenvalues of Fy are either 0 or 1. Since tr Fy = N — d by (11.64). we find 
that there are N — d eigenvalues that are | and d that are 0. Since Fy is symmetric. 
it can be diagonalized by an orthogonal matrix Uy: 


Uy FxUl = Dy 


where Dx; consists of N — d ones and d zeros along the diagonal. As in the proof 
of Lemma II.1. 


N 3 1 2 1 "ONE. | ' 
—Vy(ĝn) = —|Yy — nôn? = —lFy Ww)? = WIFE Fy Wy 
Xo Ag Ao ho 


1 1 
—W} Fy Wy = - 
Ag '’ À 


N- 
1 
T7;T —2 
UL DyUy Wy -i0 a(t 
F i NUNUNWN Jo = wy (tT) 


where Wy (t) are the components of the vector Uy Ws;. But since the components 
of Wy are independent and normal. with variance 2.o. so are those of Uy Wy. the 
matrix Uy being orthogonal. This proves the lemma. o 


A consequence of the lemma is that the estimate În of ào obeys 


Aw 7 
(N — dy e x2(N —d) (11.72) 
0 


which implies that 
245 
N-d 


E(iy — Ao)? = (11.73) 


Confidence Intervals 
In the case (I1.50) where a true parameter 6 exists. the distribution result (11.68) 
tells us how the deviation between y and @ is distributed: 


Ên — 0% € N(O, Py) (11.74) 
For the ith component, we thus have 
a — eP e NO. Py”) 
or 


€ N(O, 1) (11.75) 


Here PË” indicates the ith diagonal element of Py. Hence the probability that 


ef) deviates from 6) with more than a - \/ Py WO isthe (1 — @)-level of the normal 


distribution, which is available in standard aal tables. 
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In fact, ([1.68) tells us more than how each component of 6x is distributed. 
Since Py is the covariance matrix of the joint distribution of the vector 64. we also 


have useful information about the covariance and correlation between the different 
components of 64. This is most easily utilized as follows. From (11.74) we have 


(Ôn — 0)" Py'(On — O) € x7(d) (11.76) 
by a direct application of the definition of the x?-distribution. The probability that 


lôn — bolp = Ôr — 00)” Py (Ox — M) > a (11.77) 


is thus x2(d). the a-level of the x*(d)-distribution. The expressions (II.77) define 
ellipsoids in R? , whose shape is determined by Py. See Figure II.2. 


12) | 
9 


6. D i 


Figure 11.2 Shaded area: lÊy — Bolo < constant. 
In case Qy = I and Ry = 1, the expression for Py is (11.61) 
Py = [tioa] (11.78) 


While the matrix [b7 O nJ~! is known to the user, Ay is typically not. which impairs 
the use of the results (11.75) and (11.76). An immediate approach would be to replace 
Ag by the estimate An. In view of (11.73), this is a good approximation for large N. 
and the use of the preceding confidence limits is still reasonable. With Lemma IT.2. 
amore exact result can, however, be achieved. We have 


(Ôn — o)" PZ (Ôn — 4) 


Àn /ào 


€ F(d,N — d) (11.79) 


- (Ôn — OV [PEON] Ân — o) = 


r|- 
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where the last step follows from the definition of the F -distribution as the distribu- 
tion of the ratio of two x7-distributed variables. The leftmost expression is available 
to the user. and thus the probability that 0) deviates from 8y in such a way that 


1 2 2 
i — Plietox) = 2 (11.80) 


can be computed as the a-level of the F(d. N — d) distribution. Note that. as 
N — æ, F(d. N — d) approaches x*(d). Hence using AN for Ag in Py; in (11.77) is 
often reasonable. Similarly. using aw for Ay in (I].75) replaces the normal distribution 
by Student's t-distribution. 


li.3 SOME FURTHER TOPICS IN LEAST-SQUARES ESTIMATION 


Selecting Regressors 


A major problem when applying regression analysis to practical problems is to settle 
for a good set of regressor variables. This means that we have to determine which 
variables g; (t) may influence the output y(t) in (II.3). Clearly, this is a very applica- 
tion dependent task and requires a good understanding of the process to be described. 
The choice of y; may, however. also be supported by some formal procedures that 
are described in the statistical literature. 

Selecting the regressors in (11.3) corresponds to the choice of model structure 
in system identification setup. This problem is discussed in more detail in Chapter 
16. and we shall only provide a short preview here. 

A basic tool is to investigate the sequence of residuals GOIN defined by 


(11.37). If (11-49) to (II.51)} hold and Ôx gives a correct description of the process. then 
ên (t) would equal wy(r) in (I1.46) and would thus be a sequence of random variables. 
Such a hypothesis can be tested in various ways. Also, the residual sequence should 
be uncorrelated with all potential regressors. If it is not, the regressor in question 
has something to offer for the prediction of v(t) and it should thus be included in 
the regressor set. This may be a useful way of conducting the search for informative 
regressors. Daniel and Wood (1980)contains practical advice in this respect. 

Another way of determining whether a regressor should be included in the 
regressor set is to check whether it leads to a significant reduction of the criterion 
function Vy (On ). It should be clear that the minimal] value of the criterion function 
wil] automatically decrease when a new regressor is added. whether or not it is 
actually correlated with the output. This follows since the minimization of Vy is 
performed over a larger set. As a simple special case, think of the process (11.49) to 
(11.51) with 6 = 0. The criterion function 


N N 
1 > l 
WO = = y= rie welt) 
t=l t=1 
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then has mean value Ay. If, however. we minimize over a d-dimensional regressor 


space 0. we obtain 


z> 


1 N 
Vy (Ox) = N LH 


which according to Lemma II.1 has mean value [(N — d)/N]Ao. 


We thus see an 


average decrease of the criterion function of (d@/N)Ay. despite the fact that there 
was nothing in y(t} to explain by the regressors! The improv ed fit is spurious and 


can be seen as an overfit to the particular realization of {u(r}. 


Consequently. the observed decrease in Vy when new regressors are added 
must be matched against this overfit. Several ways to do this are discussed in Chapter 


16. For the present setup we have the following formal result: 


Lemma 11.4. Suppose that the data can be described by 


v(t) = p(t) + eolt) 
where (é)(t}} is white Gaussian noise with variance Ag. Let 


N 
(1) i l Tasai? 
Vy = min ) [y(t) — o (t)0] 


tal 


Consider another regression with some added regressors 


v(t) = g(t) + TOn 


and let 
N 
Vi = min Yl — ee — orn 
a t=] 4 
Let 
d = dim@, r = dimy 
Then 
@ 
(a) E€ X7 N ~d—r) 
ya Vy 
b) X——*_ e xr) 
Ao 


(c) vý Pad Vi? and Vee are independent 


N-d-r Va ea 


(2) 
Vy 


(d) t(d.r,N) = 


e Fir,N-—d-r) 


(11.81) 


(11.82) 


(11.83) 


(11.84) 


Proof. (a) is a restatement of Lemma II.2 and (d) is a consequence of (a). (b). and 
(c) by the definition of the F-distribution. The proofs of (b) and (c) are analogous 


to the proof of Lemma I1.2. 
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The lemma implies that, if (11.81) holds so that the inclusion of 7 is “unneces- 
sary,” then the normalized decrease in the criterion is distributed according to (11.84). 
If the observed decrease is significantly larger [i.e. if t(d.r.N) > Fo(r, N—d—r)]. 
then the conclusion should be that the inclusion of y is useful and that, consequently, 
(11.81) does not hold. Recall that the F-distribution can often be approximated by 
the x?-distribution for large N. 


Multiple Regressions 


Sometimes there is occasion to study the simultaneous prediction of several, say p, 
variables. This means that our variable y(t) will be a p-dimensional column vector. 
In systems applications this corresponds to multivariable systems. Much of what has 
been said here about linear regression will hold also for the multivariate case, but 
some algebraic expressions take a slightly different form. 

It is convenient to distinguish between two cases: 


The same set of regressors is used for each component of y(t): Denote the regres- 
sors by the r-dimensional column vector g(t). Then write the regression as 
y(t) = 07 g(t) (11.85) 


where @ now isan r x p-dimensional matrix with its 7th column containing the coef- 
ficients associated with the ith component of y. The number of estimated parameters 
thus is d = rp. The LS criterion becomes 


N 
1 , 
Vv(@) = + DI) — 0 OP (11.86) 


t=] 


which is minimized by 


N -1 N 
oe Pe T, 1 T 
6, = È D J N 29003 (t) (1.87) 


(see Problem 7D.2). 


Different regressor sets for the different components of y: In case the different 
outputs are associated with different regressor sets, one must introduce a d x p 
matrix (t) whose ith column contains the regressors associated with output 7 and 
very possibly several zeros. Then the regression is written 


y(t) = 97(1)0 (11.88) 


with @ as a d-dimensional column vector. The LS criterion becomes 
1 N 
— — hd —w T 
Vn) = = Dl) -= e09- 


r=] 


N 
1 
= z LO - ee] A "LO — 9704] (11.89) 


t=1 
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Here we allowed a general quadratic norm to give different weights to the different 
components of y(t). The element that minimizes (11.89) is 


N -i N 
v= 2S ewavol [Lemay 
by = É 2. (HA e o| È 2 ot) A v0] (11.90) 


(see Problem 7D.2). 
Comparing (11.90) with (11.87), we see the advantage with the special structure 
(11.85): To determine the r x p estimate 6, in (11.87), it is sufficient to invert an r xr 


matrix. In (11.90), @y is a pr vector and the matrix inversion involves a pr x pr 
matrix. 


Correlation Interpretation and the Instrumental-variable Method 


A useful interpretation of the LS method is as follows. Given the description 
y(t) = p(t) + wolt) 


multiply both sides by g(t) and sum over t = 1, ..., N. This leads to 


1 = 1 N 1 N 
N Dayo = E Ewe] Oo + N Y= pawol) (IL91) 


t=1 t=] 


Provided that the disturbance {w(t)} and the regression vector (t) are uncorre- 
lated, which means that the rightmost term is small, we find that the LSE 


ie aT Be: 
ov = E d ve o| = Eeo (11.92) 


=1 t=1 


is a reasonable estimate of 69. We have, thus “correlated out” 9) from the noise 
using the regressor sequence. 

In some cases we may expect correlation between the noise and the regressors. 
A natura! extension of the preceding correlation idea would then be to use a d- 
dimensional vector sequence {¢(1)} that is uncorrelated with the noise. but correlated 
with the regressor. Multiplying by ¢(7) and summing over ¢ leads to 


N N x 
1 1 1 
N 2 C(t)y¥(t) = m p roto] o + N 2 C(t) wo(t) (11.93) 


If {f(¢)} has the two above-mentioned properties, we find that 


N 


N -i 
ž ] 1 
O= È so9"e x $ Oye) (11.94) 


t=1 t=1 
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is a reasonable estimate of 4. This estimate was introduced by Reiersgl (1941 Jand is 
discussed in some detail in Kendall and Stuart (1961). It isknown as the instrumental- 
variable estimate (IVE). and ¢(f) is called instrumental variables or the instruments. 
It remains to be discussed how to choose ¢(f). and that is done in Section 7.6 in 
connection with applications to dynamical systems. 


Nonlinear Regressions 


The characterizing feature of the linear regression model (11.3) is that the regression 
function is linearly parametrized in 6. Often one has to consider more general 
parametrizations of the regression function: 


glg. 9) (11.95) 


We thus obtain a nonlinear regression 
y(t) = g(v(t).@) (11.96) 


The weighted least-squares criterion 


1 N 
VNO) = F LADO — (vie). oF (11.97) 


t=1 


can still be used as a measure of fit, and the estimate becomes 


A 


On = arg min Vy (0) 
A 


just as in (II.7) and (II.8). The important difference is that it may not be possible 
to find explicit expressions for Ôn as in (11.10), but one has to resort to numerical, 
iterative techniques. 

To make things still more general, it is not necessary to use a quadratic measure 
of fit in (11.97). Let €(€) be a function that assumes positive values suitable to measure 
the “size” of £ and take 


Vn (8) 


N 

1 

x Yet) — glei), 0) 
f=1 


On 


arg min Vy (8) (11.98) 
A 


The problem (11.96) and (11.98) clearly is more general and more difficult than (II.7) 
and (I.8). but it certainly is in the same spirit. Gauss’s pragmatic interpretation of 
the LS criterion as a measure of fit of historic data applies equally well to (11.98). This 
way of selecting an estimate makes sense also without a probabilistic framework. 
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1.4 PROBLEMS 
H.i = Show that 


ey) = Elylg] 
minimizes 
Ely — go 
with respect to g [see (II.1) and (JI.2)]. Hint: Add and subtract 2(¢) in the latter 
expression. 


11.2 Let p(a. db) be the correlation between the random variables a and b: 
Cov (a, b) 

(Var (a) - Var (b))!!* 

Let gig) be defined as in Problem 11.1. Show that 


pla.b) = 


ply. giy) = ley. gel 


for any function g(y) [Rao (1973), Section 4g.1]. 
1.3  Lyaponov’s central limit theorem states: “Let 


N 
Zy = J atk. Nywtk) 


k=l 
where {u(k)} is a sequence of independent random variables with 
Ew(k) = 0 
Eur(k) = Ax 


E|w°(k)| = yr 
Assume that i 
Jim 90k. Nak = a 
k=1 
N 
; 3 + 
pa. (k. N)y = 0 


Then 
Zx € AsN(O, Ay 


Use this result to prove that the estimate (11.10) is asymptotically normal under suitable 
assumptions on {wo(?)}. 
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